From carsonhh at gmail.com Thu Jul 3 09:12:07 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:12:07 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: The hints used by MAKER are CDSpart, exonpart, intronpart, and intron. You can play around with the extrinsic evidence configuration file if you want, but it's really not well documented, so I won't be able to provide much support. Thanks, Carson On 7/1/14, 6:31 AM, "Marc H?ppner" wrote: >Hi, > >sorry for resurrecting this topic. The issue was about the use of >ab-intio predictions and artefacts in the final maker gene builds. > >I think one potential issue that hasn?t been discussed here concerns >Makers? use of the extrinsic config file when running Augustus. This file >controls the ?weights? of different types of hints when running Augustus. >I don?t think it is made clear anywhere which extrinsic config file Maker >reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it >suggested that it would be useful to manipulate this file to improve >augustus performance (and in extension Makers performance). Finally, I am >not entirely sure which sorts of hints Maker creates for Augustus and to >which hint categories these would belong to (i.e. it makes no sense to >tweak the intronpart malus factor if Maker does not create such hints). >Perhaps it would be good to elaborate on that in the Maker documentation, >since it seems to be quite relevant for obtaining better results. Or does >such an explanation already exist somewhere? > > >/Marc > >Marc P. Hoeppner, PhD >Team Leader >BILS Genome Annotation Platform >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 05 Jun 2014, at 20:28, Carson Holt wrote: > >> One thing you might want to try is adding another predictor like SNAP >> together with Augustus and then process the MAKER results using EVM. We >> actually have a collaboration with the EVM group to produce a MAKER-EVM >> pipeline (MAKER 3.0). EVM will produce consensus models using the >> predictions and the evidence in the MAKER GFF3 which are generally >>better >> than just SNAP and Augustus with hints, so it might be able to remove >>some >> of the artifacts you are worried about. >> >> --Carson >> >> >> >> On 6/5/14, 12:24 PM, "Carson Holt" wrote: >> >>> Like I said. The predictors do the best they can, so there is probably >>> something about the regions to make the CDS, reading frame, or >>>start/stop >>> work that requires exons to be dropped or added. In several ant >>>genomes >>> we saw something like this caused by incorrect homopolymers in the >>> assembly which force the predictor to slightly alter the intron/exon >>> structure because otherwise the reading frame made no sense (the EST >>> alignments were used to confirmed that the assembly homopolymers were >>> incorrect - lots of bad single base pair deletions). >>> >>> The way hints work is as follows. At the simplest level ab initio >>> predictors are calculating the probability of being in different states >>> (intergenic, intron, exon, etc.). The hints increase the probability >>>of >>> being in the intron state where MAKER gives an intron hint or being in >>>an >>> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >>> likelihood of the ab intio gene predictor to call something similar in >>> structure to the evidence overlapping it. That being said, if there is >>> strong enough signal from something else in the sequence or my hints >>>won't >>> work with the splice sites in the region or the reading frame breaks, >>>then >>> no amount of hints can force augustus to make that model. >>> >>> --Carson >>> >>> >>> >>> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> thanks for the feedback. I spent some more time on this and am still >>>> somewhat unsatisfied with the whole thing? >>>> >>>> A few comments: >>>> >>>> I quite frequently see augustus and in extension Maker including exons >>>> that are not supported by EST/Protein evidence and are not critical >>>>for >>>> the gene model (i.e. I can take them out and still get a proper CDS). >>>> Maybe I don?t know enough about how Maker creates hints and more >>>> importantly what role these hints play for augustus, but I cannot >>>>really >>>> see a great effect (any, really) on the gene models even if both ESTs >>>>and >>>> proteins contradict an augustus gene model and the surplus exon is >>>> non-essential. >>>> >>>> (all evidence is provided as fasta files, protein2genome and >>>>est2genome >>>> are set to 0) >>>> >>>> As for the repeat library, I suppose this is a critical point. I am >>>>using >>>> repeats from a closely related species via Repeatmasker, modelled and >>>> filtered repeats from RepeatModeler and repeats derived from a >>>> high-coverage 454 data set. Not sure what else I can do to improve >>>>that. >>>> >>>> As for evidence, I am using the curated reference proteome from a >>>>related >>>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>>> reads. I don?t think it gets a whole lot better, in terms of what data >>>> can be used. >>>> >>>> So in summary, I just don?t get where I want to using Augustus and >>>>Maker >>>> - specifically, the gene models are full of weird, unsupported >>>>artefacts >>>> despite manually curating > 850 models for training. I suppose I was >>>> hoping for some secret trick to improve on this - but I guess there is >>>> none? Actually, if I only do a pure evidence build (seeing that my >>>>input >>>> data is very high quality), it looks better - which sort of goes >>>>against >>>> the premise of Maker :/ >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> On 27 May 2014, at 17:25, Carson Holt wrote: >>>> >>>>> Extra exons can be required for predictors to make sense of a region >>>>> (they >>>>> do the best they can). This can be due to imperfect assemblies or >>>>> repeats. For plants the repeat database is the the one thing that >>>>>will >>>>> most affect the annotation quality. You may need to spend some time >>>>> building the best repeat library you can. The repeat library is the >>>>> next >>>>> most important thing next to training the predictor, because they >>>>> confuse >>>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>>> regions (because repeats do encode real protein and protein >>>>>fragments). >>>>> Also when running now with MAKER make sure to include the entire >>>>> proteome >>>>> of a related species and not just UniProt, and you will get better >>>>> performance. Now that you have Augustus trained, using it inside of >>>>> MAKER >>>>> with an improved repeat library and additional protein evidence >>>>>should >>>>> give it the feedback that will allow it to perform better than it >>>>>would >>>>> with just naked ab initio prediction. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I wanted to get some feedback regarding the training of ab-initio >>>>>>gene >>>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>>> many >>>>>> people on this list that have encountered and solved this issue in >>>>>>one >>>>>> way or another. >>>>>> >>>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for >>>>>>a >>>>>> plant genome. This has always been a very frustrating process for >>>>>>me, >>>>>> but >>>>>> while I have a better idea now how to do it, I still don?t get the >>>>>> sort >>>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>>> >>>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>>> Sanger-sequenced EST data >>>>>> >>>>>> Filtered for Models with an AED <= 0.3 >>>>>> >>>>>> Loaded that into WebApollo, together with an existing reference >>>>>> annotation and the evidence tracks >>>>>> >>>>>> Manually curated/selected 750 gene models using the following rules: >>>>>> - Must have start/stop codon >>>>>> - Most have as many exons as possible >>>>>> - Must agree with evidence >>>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>>> regions for augustus to train intergenic sequence) >>>>>> >>>>>> From these models, I created a GBK file, split it into 650 (train) >>>>>> and >>>>>> 100 (test) models and created a new profile using the documented >>>>>> procedure. >>>>>> >>>>>> But: >>>>>> >>>>>> While the naked ab-init models created through maker get a lot of >>>>>> genes >>>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>>> Problems include: >>>>>> >>>>>> - random exon calls which are not supported by any line of evidence >>>>>> (~1 >>>>>> per gene model, I would guess) >>>>>> - poor congruency with some gene models (especially ones not used >>>>>>for >>>>>> training/testing) >>>>>> >>>>>> Is there any best-practice guide on how to improve this? The >>>>>>Augustus >>>>>> website is unfortunately quite poor on detail? My impression so far >>>>>>is >>>>>> that ramping up the number of training models isn?t really doing too >>>>>> much >>>>>> beyond a certain point (tried 400, 500 and 750). >>>>>> >>>>>> Regards, >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> Marc P. Hoeppner, PhD >>>>>> Team Leader >>>>>> BILS Genome Annotation Platform >>>>>> Department for Medical Biochemistry and Microbiology >>>>>> Uppsala University, Sweden >>>>>> marc.hoeppner at bils.se >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>rg >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From marc.hoeppner at bils.se Tue Jul 1 07:31:33 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Tue, 1 Jul 2014 14:31:33 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Hi, sorry for resurrecting this topic. The issue was about the use of ab-intio predictions and artefacts in the final maker gene builds. I think one potential issue that hasn?t been discussed here concerns Makers? use of the extrinsic config file when running Augustus. This file controls the ?weights? of different types of hints when running Augustus. I don?t think it is made clear anywhere which extrinsic config file Maker reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it suggested that it would be useful to manipulate this file to improve augustus performance (and in extension Makers performance). Finally, I am not entirely sure which sorts of hints Maker creates for Augustus and to which hint categories these would belong to (i.e. it makes no sense to tweak the intronpart malus factor if Maker does not create such hints). Perhaps it would be good to elaborate on that in the Maker documentation, since it seems to be quite relevant for obtaining better results. Or does such an explanation already exist somewhere? /Marc Marc P. Hoeppner, PhD Team Leader BILS Genome Annotation Platform Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 05 Jun 2014, at 20:28, Carson Holt wrote: > One thing you might want to try is adding another predictor like SNAP > together with Augustus and then process the MAKER results using EVM. We > actually have a collaboration with the EVM group to produce a MAKER-EVM > pipeline (MAKER 3.0). EVM will produce consensus models using the > predictions and the evidence in the MAKER GFF3 which are generally better > than just SNAP and Augustus with hints, so it might be able to remove some > of the artifacts you are worried about. > > --Carson > > > > On 6/5/14, 12:24 PM, "Carson Holt" wrote: > >> Like I said. The predictors do the best they can, so there is probably >> something about the regions to make the CDS, reading frame, or start/stop >> work that requires exons to be dropped or added. In several ant genomes >> we saw something like this caused by incorrect homopolymers in the >> assembly which force the predictor to slightly alter the intron/exon >> structure because otherwise the reading frame made no sense (the EST >> alignments were used to confirmed that the assembly homopolymers were >> incorrect - lots of bad single base pair deletions). >> >> The way hints work is as follows. At the simplest level ab initio >> predictors are calculating the probability of being in different states >> (intergenic, intron, exon, etc.). The hints increase the probability of >> being in the intron state where MAKER gives an intron hint or being in an >> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >> likelihood of the ab intio gene predictor to call something similar in >> structure to the evidence overlapping it. That being said, if there is >> strong enough signal from something else in the sequence or my hints won't >> work with the splice sites in the region or the reading frame breaks, then >> no amount of hints can force augustus to make that model. >> >> --Carson >> >> >> >> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> thanks for the feedback. I spent some more time on this and am still >>> somewhat unsatisfied with the whole thing? >>> >>> A few comments: >>> >>> I quite frequently see augustus and in extension Maker including exons >>> that are not supported by EST/Protein evidence and are not critical for >>> the gene model (i.e. I can take them out and still get a proper CDS). >>> Maybe I don?t know enough about how Maker creates hints and more >>> importantly what role these hints play for augustus, but I cannot really >>> see a great effect (any, really) on the gene models even if both ESTs and >>> proteins contradict an augustus gene model and the surplus exon is >>> non-essential. >>> >>> (all evidence is provided as fasta files, protein2genome and est2genome >>> are set to 0) >>> >>> As for the repeat library, I suppose this is a critical point. I am using >>> repeats from a closely related species via Repeatmasker, modelled and >>> filtered repeats from RepeatModeler and repeats derived from a >>> high-coverage 454 data set. Not sure what else I can do to improve that. >>> >>> As for evidence, I am using the curated reference proteome from a related >>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>> reads. I don?t think it gets a whole lot better, in terms of what data >>> can be used. >>> >>> So in summary, I just don?t get where I want to using Augustus and Maker >>> - specifically, the gene models are full of weird, unsupported artefacts >>> despite manually curating > 850 models for training. I suppose I was >>> hoping for some secret trick to improve on this - but I guess there is >>> none? Actually, if I only do a pure evidence build (seeing that my input >>> data is very high quality), it looks better - which sort of goes against >>> the premise of Maker :/ >>> >>> Regards, >>> >>> Marc >>> >>> >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> On 27 May 2014, at 17:25, Carson Holt wrote: >>> >>>> Extra exons can be required for predictors to make sense of a region >>>> (they >>>> do the best they can). This can be due to imperfect assemblies or >>>> repeats. For plants the repeat database is the the one thing that will >>>> most affect the annotation quality. You may need to spend some time >>>> building the best repeat library you can. The repeat library is the >>>> next >>>> most important thing next to training the predictor, because they >>>> confuse >>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>> regions (because repeats do encode real protein and protein fragments). >>>> Also when running now with MAKER make sure to include the entire >>>> proteome >>>> of a related species and not just UniProt, and you will get better >>>> performance. Now that you have Augustus trained, using it inside of >>>> MAKER >>>> with an improved repeat library and additional protein evidence should >>>> give it the feedback that will allow it to perform better than it would >>>> with just naked ab initio prediction. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>> >>>>> Hi, >>>>> >>>>> I wanted to get some feedback regarding the training of ab-initio gene >>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>> many >>>>> people on this list that have encountered and solved this issue in one >>>>> way or another. >>>>> >>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>>> plant genome. This has always been a very frustrating process for me, >>>>> but >>>>> while I have a better idea now how to do it, I still don?t get the >>>>> sort >>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>> >>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>> Sanger-sequenced EST data >>>>> >>>>> Filtered for Models with an AED <= 0.3 >>>>> >>>>> Loaded that into WebApollo, together with an existing reference >>>>> annotation and the evidence tracks >>>>> >>>>> Manually curated/selected 750 gene models using the following rules: >>>>> - Must have start/stop codon >>>>> - Most have as many exons as possible >>>>> - Must agree with evidence >>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>> regions for augustus to train intergenic sequence) >>>>> >>>>> From these models, I created a GBK file, split it into 650 (train) >>>>> and >>>>> 100 (test) models and created a new profile using the documented >>>>> procedure. >>>>> >>>>> But: >>>>> >>>>> While the naked ab-init models created through maker get a lot of >>>>> genes >>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>> Problems include: >>>>> >>>>> - random exon calls which are not supported by any line of evidence >>>>> (~1 >>>>> per gene model, I would guess) >>>>> - poor congruency with some gene models (especially ones not used for >>>>> training/testing) >>>>> >>>>> Is there any best-practice guide on how to improve this? The Augustus >>>>> website is unfortunately quite poor on detail? My impression so far is >>>>> that ramping up the number of training models isn?t really doing too >>>>> much >>>>> beyond a certain point (tried 400, 500 and 750). >>>>> >>>>> Regards, >>>>> >>>>> Marc >>>>> >>>>> >>>>> Marc P. Hoeppner, PhD >>>>> Team Leader >>>>> BILS Genome Annotation Platform >>>>> Department for Medical Biochemistry and Microbiology >>>>> Uppsala University, Sweden >>>>> marc.hoeppner at bils.se >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From rajesh.bommareddy at tu-harburg.de Thu Jul 3 09:45:59 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 03 Jul 2014 16:45:59 +0200 Subject: [maker-devel] Maker output Message-ID: <53B56CA7.80108@tu-harburg.de> Dear Maker group I have run the example files provided with maker. But i am unable to understand the output. Where can i find the information about exons, CDS, protein sequence of the predicted CDS or mRNA and the predicted protein name for each contig? Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Thu Jul 3 09:51:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:51:57 -0600 Subject: [maker-devel] Maker output In-Reply-To: <53B56CA7.80108@tu-harburg.de> References: <53B56CA7.80108@tu-harburg.de> Message-ID: See the MAKER 2014 GMOD tutorial --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_ GMOD_Online_Training_2014 Also watch accompanying video --> http://youtu.be/uA96tSSaqLk Results will be in GFF3 and FASTA format. The GFF3 file contains the location of structure relative to the assembly (exon/CDS/UTR). The FASTA file contains the sequence (transcript/protein). There will be separate files for each contig. Use gff3_merge and fasta_merge to generate merged genome wide GFF3 and FASTA files. An explanation of GFF3 format is here --> http://www.sequenceontology.org/gff3.shtml Thanks, Carson On 7/3/14, 8:45 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Maker group > >I have run the example files provided with maker. But i am unable to >understand the output. Where can i find the information about exons, >CDS, protein sequence of the predicted CDS or mRNA and the predicted >protein name for each contig? > > >Thanks and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From dence at genetics.utah.edu Mon Jul 7 09:24:33 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 14:24:33 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: <8219A0C0-DBB0-4417-8B4F-39D6D7F93B93@genetics.utah.edu> Hi Saad, I think that's correct. As a sub step for each of the steps you listed, I would also choose one or two large scaffolds out of your assembly to use as a test set and use that test set to make sure that all you are getting output like you'd expect, before running MAKER on the whole genome. Let me know if there's anything else we can do to help. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 7:08 AM, Saad Arif > wrote: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt > wrote: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence > Cc: ">" > Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 10:26:05 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 11:26:05 -0400 Subject: [maker-devel] Couple quick questions about Maker Message-ID: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon Jul 7 11:00:45 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 16:00:45 +0000 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: Message-ID: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [LinkedIn] [http://ws-stats.appspot.com/ga/pixel.png?yes__count=true%20&e=legacy_impression] _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 11:26:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 10:26:43 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff option (which is pretty different). Also If you provide GFF# files for repeats, you will still need to turn of repeat masking in the control files by blanking out the options. Also MAKER uses a step called RepeatRunner against an internal transposable element protein databases which is probably still running (and is slow because it's a search in translated protein space). For performance, you may want to give a larger max_dna_len for the MAKER run given that you have a large RAM machine. Also set all the depth_blast in maker_bopts.ctl to 15 or 20. CEGMA is convenient for training predictors because it finds genes that will always be in every eukaryote (I.e. high confidence). You can combine these with est2genome/protein2genome results from MAKER if you want. You can then use the resulting HMM for a larger MAKER run with experimental evidence, and then train again on those results. But beware than there is rarely any benefit from training beyond that second round. More training actually tends to makes things worse (the overtraining paradox). --Carson From: Daniel Ence Date: Monday, July 7, 2014 at 10:00 AM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue wrote: > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome assembly). > Both these genomes are a little fragmented, so there are lots of contigs, > which slows down the whole process. I am looking for ways to speed things up > and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST and > protein evidence to build models. Had already done RepeatMasking so thought > I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so > two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one > that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should > generally allow the program to bypass the RepeatMasking step, correct? Does it > also make it bypass the Repeat ORF searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step > SNAP training from the tutorials seems straightforward, but I was wondering > about the Augustus step. From what I can tell, simply providing an Augustus > "trained" species name should turn on Augustus and blast/blat-like hints > generated within Maker are then used in gene prediction. Any thoughts on if > it's either more accurate or faster to do the Augustus predictions outside of > the Maker pipeline and then import them using the pred_gff parameter in the > maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files to > zff file for snap training? Currently, I am using predicted transcript for > this species and protein sequences from related species to training. Does > anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything > in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > Nathaniel Jue, Ph.D. > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > > iel-jue%2F1%2F531%2F176%2F&sn=> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 12:21:50 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 13:21:50 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 12:26:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 11:26:34 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just ^C. If you change options, then it will restart at a point determined by what will be affected by the change. Since repeat masking affects everything downstream, everything will start from zero. If it was a step like changing the HMM or altering blastn_depth, then it would be less disruptive and MAKER could reuse all existing raw reports. Unfortunately it's not that way for altering repeat masking options. --Carson From: Nathaniel Jue Date: Monday, July 7, 2014 at 11:21 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Tue Jul 8 10:56:37 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Tue, 8 Jul 2014 11:56:37 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 8 11:31:40 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 08 Jul 2014 10:31:40 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Convert them both to ZFF, then concatenate the ZFF and sequence files. --Carson From: Nathaniel Jue Date: Tuesday, July 8, 2014 at 9:56 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Mon Jul 7 08:08:53 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Mon, 7 Jul 2014 15:08:53 +0200 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt wrote: > "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" > > Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). > > If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > --Carson > > > > From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM > To: Daniel Ence > Cc: "" > Subject: Re: [maker-devel] Help with updating an annotation > > Thanks Daniel. I think it's more clear to me now. > > So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? > > Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. > > As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? > > Let me know if i'm still missing something here. > > Thanks in advance. > > best, > Saad > On 18 Jun 2014, at 17:21, Daniel Ence wrote: > >> Hi Saad, >> >> Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). >> >> You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. >> >> One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> >> Let me know if that helps, or if you have more question >> >> >> ~Daniel >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jun 18, 2014, at 5:09 AM, Saad Arif >> wrote: >> >>> Thank you for the response. I still have one question though, with these options: >>> >>> est_GFF=cufflinksout.GFF >>> >>> modle_GFF= ensembl reference.GFF >>> >>> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >>> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >>> >>> Any feedback on this would be greatly appreciated. >>> >>> saad >>> >>> On 13 Jun 2014, at 17:59, Carson Holt wrote: >>> >>>> Use the cufflinks instead of the tophat features (tophat tends to be >>>> really noisy). Give the existing models to model_gff (they will then >>>> always be kept unless something better is found). There is no option to >>>> keep models and then just add isoforms. The model_gff input will either >>>> be kept as is (unchanged), or replaced with an updated model suggested by >>>> the evidence (the updated model may contain multiple isoforms though), and >>>> map_forward=1 can be used to pull names forward from the old model onto >>>> the new models. >>>> >>>> Thansk, >>>> Carson >>>> >>>> >>>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>>> >>>>> Dear All, >>>>> >>>>> I would like to use Maker pipeline to expand a current annotation (new >>>>> isoforms and novel genes with respect to current annotation) and was >>>>> wondering if anyone had experience with this and or suggestions to my >>>>> questions. >>>>> >>>>> Briefly: >>>>> >>>>> I have tophat splice junctions from RNAseq data or alternatively >>>>> cufflinks generated transcript models (fasts format) that i want to use >>>>> as my new data (est_gff or est). >>>>> >>>>> I want to provide the current Ensembl annotation for gene prediction but >>>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>>> should provide this annotation as pred_gff >>>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>>> annotation for both options (pred_ and mod_gff)? >>>>> >>>>> >>>>> >>>>> Importantly, my main goal is to use the new RNAseq data to add more >>>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>>> thoughts or suggestions on how to go about this would be sincerely >>>>> appreciated. >>>>> >>>>> >>>>> Thanks in advance, >>>>> saad >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 16:38:52 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:38:52 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 16:44:48 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:44:48 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Also you can use repeat_gff in the control files, by I prefer just to rerun in the same directory as the previous job. --Carson From: Carson Holt Date: Thursday, July 10, 2014 at 3:38 PM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 17:02:57 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 15:02:57 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup Message-ID: Hi experts, I am trying to install Maker in a new machine (running Mac OS 10.7.5), and have succeed so far except for the "./Build exonerate" step, which gives me the following error: checking for socklen_t... yes checking for pkg-config... no ERROR: Could not find pkg-config ... is glib-2 installed ??? Fink for 64-bit is installed, and via 'fink list', I confimed that glib2-dev and -shlibs are installed. I unistalled and re-installed both fink and glib2 several times, hoping it was a configuration problem, but still get stuck at this step. I found a few previous questions about this issue in this forum, but the solutions Carson provided were directed for OS 10.6 only, apparently, so I did not try these. I have run into the limit of what I know how to do with these compilations. I tried setting up Exonerate directly but it has trouble finding glib as well. Any suggestions? Thank you so much! -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 18:41:59 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 16:41:59 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup In-Reply-To: References: Message-ID: OK, before anyone spends too much of their time trying to help me... I think I was able to solve my issue above. What I did was to install an additional glib2-related package using fink install. I installed glibmm2.4-dev, which also installs glibmm2.4-shlib. These make up a C++ interface for the glib2 library, according to their description. Once I installed those packages, I re-ran ./Build exonerate and it seemed to work. I tried a exonerate command in Terminal and it recognized it OK. Hopefully what I did won't cause any issues down the line. Thanks. On Thu, Jul 10, 2014 at 3:02 PM, Felipe Barreto wrote: > Hi experts, > > I am trying to install Maker in a new machine (running Mac OS 10.7.5), and > have succeed so far except for the "./Build exonerate" step, which gives me > the following error: > > checking for socklen_t... yes > checking for pkg-config... no > ERROR: Could not find pkg-config ... is glib-2 installed ??? > > > Fink for 64-bit is installed, and via 'fink list', I confimed that > glib2-dev and -shlibs are installed. I unistalled and re-installed both > fink and glib2 several times, hoping it was a configuration problem, but > still get stuck at this step. > > I found a few previous questions about this issue in this forum, but the > solutions Carson provided were directed for OS 10.6 only, apparently, so I > did not try these. I have run into the limit of what I know how to do with > these compilations. > > I tried setting up Exonerate directly but it has trouble finding glib as > well. > > Any suggestions? > > Thank you so much! > -- > Felipe Barreto > Post-doctoral Scholar > Scripps Institution of Oceanography > University of California, San Diego > La Jolla, CA 92093 > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 11 06:56:03 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 11 Jul 2014 13:56:03 +0200 Subject: [maker-devel] (no subject) Message-ID: I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Jul 11 09:08:43 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 11 Jul 2014 14:08:43 +0000 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Hi Panos, 1) You'll only use est2genome and protein2genome for creating models that will be used for training the ab-initio predictors (like SNAP). Sometimes that means one run of MAKER for training; sometimes that means two runs of MAKER. You usually don't gain any accuracy after the second round of training. It's ok to use both EST and protein data for this training step. 2) If you're using both ESTs and protein sequence to train your ab-initio predictors, then both est2genome and protein2genome should be set to 1. 3) If you want to pass Blast results to MAKER, you'll need to pass those results as GFF3, but MAKER will install and run blast for you, and does a good job of keeping track of all those results and making them accessible to you in the end, so it's going to be a lot of work to do those blasts on your own outside of MAKER. I seriously suggest that you use blast internal to maker. Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ________________________________ From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos Ioannidis [panos.ioannidis at gmail.com] Sent: Friday, July 11, 2014 5:56 AM To: maker-devel Subject: [maker-devel] (no subject) I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 14 02:20:50 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 14 Jul 2014 09:20:50 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models > that will be used for training the ab-initio predictors (like SNAP). > Sometimes that means one run of MAKER for training; sometimes that means > two runs of MAKER. You usually don't gain any accuracy after the second > round of training. It's ok to use both EST and protein data for this > training step. > > 2) If you're using both ESTs and protein sequence to train your > ab-initio predictors, then both est2genome and protein2genome should be set > to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a > good job of keeping track of all those results and making them accessible > to you in the end, so it's going to be a lot of work to do those blasts on > your own outside of MAKER. I seriously suggest that you use blast internal > to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ------------------------------ > *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of > Panos Ioannidis [panos.ioannidis at gmail.com] > *Sent:* Friday, July 11, 2014 5:56 AM > *To:* maker-devel > *Subject:* [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of > questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So > for the first run I see that some people use only the ESTs and some others > use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess > that the ESTs will give better models, but for the cases where genes aren't > covered by an EST, it's okay to have a protein database to detect them as > well. Am I right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? > Should they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and > giving Maker directly the results. I guess that in this case, I'll have to > first convert the BLAST output to a gff3 file and give it to the > protein_gff parameter, right? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 09:46:50 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:46:50 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 09:49:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:49:33 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Also one more question. What is the exact error text you get for the forks error? Is it a forks.pm error or is it an MPI warn on fork error (which are actually very different). --Carson From: Carson Holt Date: Monday, July 14, 2014 at 8:46 AM To: Panos Ioannidis , Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m aker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 01:59:18 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 08:59:18 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm, because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the > forks error? Is it a forks.pm error or is it an MPI warn on fork error > (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence < > dence at genetics.utah.edu> > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. > The filtering and polishing done by MAKER is rather significant (direct > BLAST is actually worse with homology searches than many people realize). > > With respect to forks.pm, your admin most likely edited the wrong forks.pm. > There may be more than one on your system. If you let maker install some > prerequisites for you (because it requires a specific version of forks.pm), > it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify > the exact location of the forks.pm being used. Or if he is editing it as > part of the install tarball, his edits may actually be undone during the > installation procedure. > > Use this command line to identify the location of the forks.pm module > that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm" > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is > because I'm currently not able to run Maker on our cluster due to a problem > in the Perl "forks" library. And it looks like there isn't much I can do > about it; I tried Perlbrew but it doesn't work when I try to install > versions <5.18 (the problem in forks occurs on 5.18 and later versions). > Our admin also tried to change the code in the forks.pm file as per > Carson's suggestion in another thread, but that didn't work either... As a > result I'm running Maker on my workstation (really slooow) till a solution > is found and since BLAST is a time-consuming step I was thinking of running > it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence > wrote: > >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible >> to you in the end, so it's going to be a lot of work to do those blasts on >> your own outside of MAKER. I seriously suggest that you use blast internal >> to maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> ------------------------------ >> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >> Panos Ioannidis [panos.ioannidis at gmail.com] >> *Sent:* Friday, July 11, 2014 5:56 AM >> *To:* maker-devel >> *Subject:* [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of >> questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So >> for the first run I see that some people use only the ESTs and some others >> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >> that the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? >> Should they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos >> > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 08:58:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 07:58:20 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you are getting a segfault. It is more likely an MPI error especially if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that have bugs on forks and system calls. If it is OpenMPI, run the following command before launching MAKER --> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so Make sure to set replace openmpi_location with the location of your OpenMPI. Also add the following to your MPI command before running MAKER. --> -mca btl ^openib Example --> mpiexec -mca btl ^openib -n 40 maker If you are using MVAPICH2, then you need to switch to OpenMPI. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 12:59 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm , because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the forks > error? Is it a forks.pm error or is it an MPI warn on fork > error (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence > > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. The > filtering and polishing done by MAKER is rather significant (direct BLAST is > actually worse with homology searches than many people realize). > > With respect to forks.pm , your admin most likely edited the > wrong forks.pm . There may be more than one on your system. > If you let maker install some prerequisites for you (because it requires a > specific version of forks.pm ), it may be in > .../maker/perl/lib/forks.pm . Otherwise you have to > identify the exact location of the forks.pm being used. Or > if he is editing it as part of the install tarball, his edits may actually be > undone during the installation procedure. > > Use this command line to identify the location of the forks.pm > module that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm " > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is because > I'm currently not able to run Maker on our cluster due to a problem in the > Perl "forks" library. And it looks like there isn't much I can do about it; I > tried Perlbrew but it doesn't work when I try to install versions <5.18 (the > problem in forks occurs on 5.18 and later versions). Our admin also tried to > change the code in the forks.pm file as per Carson's > suggestion in another thread, but that didn't work either... As a result I'm > running Maker on my workstation (really slooow) till a solution is found and > since BLAST is a time-consuming step I was thinking of running it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible to >> you in the end, so it's going to be a lot of work to do those blasts on your >> own outside of MAKER. I seriously suggest that you use blast internal to >> maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >> Ioannidis [panos.ioannidis at gmail.com] >> Sent: Friday, July 11, 2014 5:56 AM >> To: maker-devel >> Subject: [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So for >> the first run I see that some people use only the ESTs and some others use >> ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that >> the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >> they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 09:03:12 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 16:03:12 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially > if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries > that have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your > OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll give it > another try later today. > > As for the error, it's just "Segmentation fault"! And we know this > segfault is because of forks.pm, because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just > the line "use forks;" and try to run it, you'll get a segfault. So it looks > like it's something pretty general and serious, and I'm really surprised I > can't find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > >> Also one more question. What is the exact error text you get for the >> forks error? Is it a forks.pm error or is it an MPI warn on fork error >> (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence < >> dence at genetics.utah.edu> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. >> The filtering and polishing done by MAKER is rather significant (direct >> BLAST is actually worse with homology searches than many people realize). >> >> With respect to forks.pm, your admin most likely edited the wrong >> forks.pm. There may be more than one on your system. If you let maker >> install some prerequisites for you (because it requires a specific version >> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >> have to identify the exact location of the forks.pm being used. Or if he >> is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm module >> that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm" >> >> You can even send me a copy of the file once it has been edited, and I >> can tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install >> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >> Our admin also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution >> is found and since BLAST is a time-consuming step I was thinking of running >> it separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >> wrote: >> >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models >>> that will be used for training the ab-initio predictors (like SNAP). >>> Sometimes that means one run of MAKER for training; sometimes that means >>> two runs of MAKER. You usually don't gain any accuracy after the second >>> round of training. It's ok to use both EST and protein data for this >>> training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your >>> ab-initio predictors, then both est2genome and protein2genome should be set >>> to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible >>> to you in the end, so it's going to be a lot of work to do those blasts on >>> your own outside of MAKER. I seriously suggest that you use blast internal >>> to maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> ------------------------------ >>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>> Panos Ioannidis [panos.ioannidis at gmail.com] >>> *Sent:* Friday, July 11, 2014 5:56 AM >>> *To:* maker-devel >>> *Subject:* [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of >>> questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>> Should they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and >>> giving Maker directly the results. I guess that in this case, I'll have to >>> first convert the BLAST output to a gff3 file and give it to the >>> protein_gff parameter, right? >>> >>> Thanks, >>> Panos >>> >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 09:10:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 08:10:24 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you don't have MPI support, it's not an issue, and your Seg fault is likely something else. Your reference to perl 5.18 and forks.pm should not be a segfault error either, and would not represent your error. The Perl 5.18/forks.pm is a different issue where perl actually tells itself to die because hash reshuffling isn't safe whereas segfaults are causes by binary corruption or incorrect memory access issues (very different issues). I'd actually recommend a full perl reinstall if you are getting segfaults, because it suggests a deeper issue. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 8:03 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially if > you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that > have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel > > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll > give it another try later today. > > As for the error, it's just "Segmentation fault"! And we know this segfault is > because of forks.pm , because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just the > line "use forks;" and try to run it, you'll get a segfault. So it looks like > it's something pretty general and serious, and I'm really surprised I can't > find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> Also one more question. What is the exact error text you get for the forks >> error? Is it a forks.pm error or is it an MPI warn on >> fork error (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence >> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. The >> filtering and polishing done by MAKER is rather significant (direct BLAST is >> actually worse with homology searches than many people realize). >> >> With respect to forks.pm , your admin most likely edited >> the wrong forks.pm . There may be more than one on your >> system. If you let maker install some prerequisites for you (because it >> requires a specific version of forks.pm ), it may be in >> .../maker/perl/lib/forks.pm . Otherwise you have to >> identify the exact location of the forks.pm being used. Or >> if he is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm >> module that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm " >> >> You can even send me a copy of the file once it has been edited, and I can >> tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install versions >> <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin >> also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution is >> found and since BLAST is a time-consuming step I was thinking of running it >> separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models that >>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>> that means one run of MAKER for training; sometimes that means two runs of >>> MAKER. You usually don't gain any accuracy after the second round of >>> training. It's ok to use both EST and protein data for this training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>> predictors, then both est2genome and protein2genome should be set to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible to >>> you in the end, so it's going to be a lot of work to do those blasts on your >>> own outside of MAKER. I seriously suggest that you use blast internal to >>> maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>> Ioannidis [panos.ioannidis at gmail.com] >>> Sent: Friday, July 11, 2014 5:56 AM >>> To: maker-devel >>> Subject: [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>> they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and giving >>> Maker directly the results. I guess that in this case, I'll have to first >>> convert the BLAST output to a gff3 file and give it to the protein_gff >>> parameter, right? >>> >>> Thanks, >>> Panos >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Wed Jul 16 07:26:56 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Wed, 16 Jul 2014 14:26:56 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm should > not be a segfault error either, and would not represent your error. The > Perl 5.18/forks.pm is a different issue where perl actually tells itself > to die because hash reshuffling isn't safe whereas segfaults are causes by > binary corruption or incorrect memory access issues (very different > issues). I'd actually recommend a full perl reinstall if you are getting > segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only > occurring when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > >> If you are getting a segfault. It is more likely an MPI error especially >> if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries >> that have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your >> OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel < >> maker-devel at yandell-lab.org> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! We'll give it >> another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this >> segfault is because of forks.pm, because if you remove the "use forks;" >> line script execution continues without segfault (till it crashes later for >> another reason, of course). In fact, even if you create a script with just >> the line "use forks;" and try to run it, you'll get a segfault. So it looks >> like it's something pretty general and serious, and I'm really surprised I >> can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> >>> Also one more question. What is the exact error text you get for the >>> forks error? Is it a forks.pm error or is it an MPI warn on fork error >>> (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence < >>> dence at genetics.utah.edu> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. >>> The filtering and polishing done by MAKER is rather significant (direct >>> BLAST is actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm, your admin most likely edited the wrong >>> forks.pm. There may be more than one on your system. If you let maker >>> install some prerequisites for you (because it requires a specific version >>> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >>> have to identify the exact location of the forks.pm being used. Or if >>> he is editing it as part of the install tarball, his edits may actually be >>> undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm module >>> that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm" >>> >>> You can even send me a copy of the file once it has been edited, and I >>> can tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >>> Our admin also tried to change the code in the forks.pm file as per >>> Carson's suggestion in another thread, but that didn't work either... As a >>> result I'm running Maker on my workstation (really slooow) till a solution >>> is found and since BLAST is a time-consuming step I was thinking of running >>> it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>> >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models >>>> that will be used for training the ab-initio predictors (like SNAP). >>>> Sometimes that means one run of MAKER for training; sometimes that means >>>> two runs of MAKER. You usually don't gain any accuracy after the second >>>> round of training. It's ok to use both EST and protein data for this >>>> training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your >>>> ab-initio predictors, then both est2genome and protein2genome should be set >>>> to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass >>>> those results as GFF3, but MAKER will install and run blast for you, and >>>> does a good job of keeping track of all those results and making them >>>> accessible to you in the end, so it's going to be a lot of work to do those >>>> blasts on your own outside of MAKER. I seriously suggest that you use blast >>>> internal to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> ------------------------------ >>>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>>> Panos Ioannidis [panos.ioannidis at gmail.com] >>>> *Sent:* Friday, July 11, 2014 5:56 AM >>>> *To:* maker-devel >>>> *Subject:* [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of >>>> questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's >>>> already sequenced, I will have to run maker twice (according to the >>>> tutorial). So for the first run I see that some people use only the ESTs >>>> and some others use ESTs and a protein database (CEGMA, Uniref50, >>>> Swiss-Prot, etc). I guess that the ESTs will give better models, but for >>>> the cases where genes aren't covered by an EST, it's okay to have a protein >>>> database to detect them as well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>>> Should they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and >>>> giving Maker directly the results. I guess that in this case, I'll have to >>>> first convert the BLAST output to a gff3 file and give it to the >>>> protein_gff parameter, right? >>>> >>>> Thanks, >>>> Panos >>>> >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 16 09:04:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 08:04:55 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: You don't have to do a system wide install. It is incredibly easy to have multiple installations of Perl. Perlbrew for example makes it easy to install and switch between multiple versions rapidly (and doesn't affect the system install) --> http://perlbrew.pl You can then test. The perl installation used by different programs is determined by the '#!' header in the executable script and not by the default location of your system's perl (look at the first line in .../maker/bin/maker and you will see what I mean). This value gets set during the initial installation, and whatever perl path you use to run MAKER's Build.PL script will end up being the one used to run MAKER, even if the system perl is different. --Carson From: Panos Ioannidis Date: Wednesday, July 16, 2014 at 6:26 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm > should not be a segfault error either, and would not > represent your error. The Perl 5.18/forks.pm is a different > issue where perl actually tells itself to die because hash reshuffling isn't > safe whereas segfaults are causes by binary corruption or incorrect memory > access issues (very different issues). I'd actually recommend a full perl > reinstall if you are getting segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel > > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only occurring > when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: >> If you are getting a segfault. It is more likely an MPI error especially if >> you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that >> have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel >> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! >> We'll give it another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this segfault >> is because of forks.pm , because if you remove the "use >> forks;" line script execution continues without segfault (till it crashes >> later for another reason, of course). In fact, even if you create a script >> with just the line "use forks;" and try to run it, you'll get a segfault. So >> it looks like it's something pretty general and serious, and I'm really >> surprised I can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >>> Also one more question. What is the exact error text you get for the forks >>> error? Is it a forks.pm error or is it an MPI warn on >>> fork error (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence >>> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. The >>> filtering and polishing done by MAKER is rather significant (direct BLAST is >>> actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm , your admin most likely edited >>> the wrong forks.pm . There may be more than one on your >>> system. If you let maker install some prerequisites for you (because it >>> requires a specific version of forks.pm ), it may be in >>> .../maker/perl/lib/forks.pm . Otherwise you have to >>> identify the exact location of the forks.pm being used. >>> Or if he is editing it as part of the install tarball, his edits may >>> actually be undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm >>> module that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm " >>> >>> You can even send me a copy of the file once it has been edited, and I can >>> tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our >>> admin also tried to change the code in the forks.pm file >>> as per Carson's suggestion in another thread, but that didn't work either... >>> As a result I'm running Maker on my workstation (really slooow) till a >>> solution is found and since BLAST is a time-consuming step I was thinking of >>> running it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models that >>>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>>> that means one run of MAKER for training; sometimes that means two runs of >>>> MAKER. You usually don't gain any accuracy after the second round of >>>> training. It's ok to use both EST and protein data for this training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>>> predictors, then both est2genome and protein2genome should be set to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>>> results as GFF3, but MAKER will install and run blast for you, and does a >>>> good job of keeping track of all those results and making them accessible >>>> to you in the end, so it's going to be a lot of work to do those blasts on >>>> your own outside of MAKER. I seriously suggest that you use blast internal >>>> to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> >>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>>> Ioannidis [panos.ioannidis at gmail.com] >>>> Sent: Friday, July 11, 2014 5:56 AM >>>> To: maker-devel >>>> Subject: [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's already >>>> sequenced, I will have to run maker twice (according to the tutorial). So >>>> for the first run I see that some people use only the ESTs and some others >>>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>>> that the ESTs will give better models, but for the cases where genes aren't >>>> covered by an EST, it's okay to have a protein database to detect them as >>>> well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>>> they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and giving >>>> Maker directly the results. I guess that in this case, I'll have to first >>>> convert the BLAST output to a gff3 file and give it to the protein_gff >>>> parameter, right? >>>> >>>> Thanks, >>>> Panos >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nguyenan at mail.nih.gov Wed Jul 16 12:15:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 17:15:10 +0000 Subject: [maker-devel] Maker_opts.ctl Message-ID: Hi, I would like to conduct a genome annotation and have the following data: - Two separate RepeatMasker outputs (using -lib and -species options) - ESTs and RACE (fasta) - proteins (fasta) - proteins of related organisms (fasta) - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) - GeneMark's .hmm file (es.mod file from running gm_es.pl) - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. - Cufflinks and Trinity from RNA-Seq Could you please let me know how can I specify parameters in the maker_opts.ctl file? Or do you have other suggestions to re-do the data listed above? Thanks. Anh-Dao From dence at genetics.utah.edu Wed Jul 16 13:13:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:13:46 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: Message-ID: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Hi Anh-Dao, In the maker_opts.ctl file, there are options for est and protein evidence. You?ll put all of your fasta est files together in a command separated list in the ?est" option, and all of your fasta protein files in a command separated list for the ?protein? option. You?ll specify the SNAP and Genemark files in their respective options in the control file and pass the augustus and fgenesh predictions in the ?pred_gff? option. If you have the RepeatMasker output in gff3 format you can give it to maker with the ?rm_gff? option. If you?ve converted the cufflinks output to gff3, you can give it to maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta output, so you would put that in the ?est? option, along with all the other est fasta files. If Augustus isn?t trained for your particular organism, then you can use another organism that augustus is already trained for. The list of species that augustus has parameter files for is in the README.txt that came with Augustus. I really recommend that you run Augustus from inside maker, because then you get all the benefits of maker passing ext-based hints to augustus at runtime, which can really improve Augustus? predictive ability. When you ran the augustus gene prediction separately, did you use another organism?s parameter file? Thanks, Daniel On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Hi, > > I would like to conduct a genome annotation and have the following data: > - Two separate RepeatMasker outputs (using -lib and -species options) > - ESTs and RACE (fasta) > - proteins (fasta) > - proteins of related organisms (fasta) > - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) > - GeneMark's .hmm file (es.mod file from running gm_es.pl) > - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. > - Cufflinks and Trinity from RNA-Seq > > Could you please let me know how can I specify parameters in the maker_opts.ctl file? > Or do you have other suggestions to re-do the data listed above? > > Thanks. > Anh-Dao > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 13:30:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 18:30:10 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Thanks Daniel for your quick response. I did not use the parameter file of other organism when running Augustus. I created the parameter file for the genome following their instructions. There were multiple steps to train and run Augustus (Creating gene structures for training AUGUSTUS with CEGMA => parameter file will be created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) As I mentioned the reason why I ran Augustus separately, because Augustus has not trained that genome (no parameter file exists). Otherwise I would run Augustus inside MAKER. You suggested to use rm_gff option to specify RepeatMasker output (sure I will convert them to .gff3 formatted files). Can I submit two RM .gff3 files, separated by comma? Anh-Dao On 7/16/14 2:13 PM, "Daniel Ence" wrote: >Hi Anh-Dao, > >In the maker_opts.ctl file, there are options for est and protein >evidence. You?ll put all of your fasta est files together in a command >separated list in the ?est" option, and all of your fasta protein files >in a command separated list for the ?protein? option. > >You?ll specify the SNAP and Genemark files in their respective options in >the control file and pass the augustus and fgenesh predictions in the >?pred_gff? option. > >If you have the RepeatMasker output in gff3 format you can give it to >maker with the ?rm_gff? option. > >If you?ve converted the cufflinks output to gff3, you can give it to >maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >output, so you would put that in the ?est? option, along with all the >other est fasta files. > >If Augustus isn?t trained for your particular organism, then you can use >another organism that augustus is already trained for. The list of >species that augustus has parameter files for is in the README.txt that >came with Augustus. I really recommend that you run Augustus from inside >maker, because then you get all the benefits of maker passing ext-based >hints to augustus at runtime, which can really improve Augustus? >predictive ability. > >When you ran the augustus gene prediction separately, did you use another >organism?s parameter file? > >Thanks, >Daniel > > >On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Hi, >> >> I would like to conduct a genome annotation and have the following data: >> - Two separate RepeatMasker outputs (using -lib and -species options) >> - ESTs and RACE (fasta) >> - proteins (fasta) >> - proteins of related organisms (fasta) >> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>format, etc. ) >> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>the outputs to .gff3 files. The reason why I ran Augustus gene >>prediction separately, because the genome has never been trained for >>Augustus. >> - Cufflinks and Trinity from RNA-Seq >> >> Could you please let me know how can I specify parameters in the >>maker_opts.ctl file? >> Or do you have other suggestions to re-do the data listed above? >> >> Thanks. >> Anh-Dao >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Wed Jul 16 13:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 13:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 13:41:47 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:41:47 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: If you can provide me the command lines you used to train augustus, I can point you to the proper species parameters to give to MAKER. Normally these are the same as one of the directory names under .../augustus/config/species/. You can also let MAKER run FGENESH for you. Either way you can pass it in as GFF3, but if you let MAEKR run it for you then MAEKR can "talk" to the predictor by giving it evidence based hints as it is running. This improves the overall performance of the algorithm compared to running it outside of MAKER. Thanks, Carson On 7/16/14, 12:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jul 16 13:42:16 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:42:16 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Hi Anh-Dao, so as I understand it, the process of training and running augustus will create a set of ?param? file that Augustus can use later on. If that?s true, then you can just copy those files to the ?config/species? folder of your augustus installation and then augustus (when you call it from inside maker) can use those parameters when it runs. Did you end up with a gff3 file or with files like ?exon_prob?, ?utr_probs? from augustus? Or did you have both? I?m pretty sure that you can?t use a comma-separated list for the rm_gff. You could concatenate the two files and then pass the one file to maker, but you also might need to have it sorted by genomic location. Carson could confirm that for me. ~Daniel On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Thanks Daniel for your quick response. > > I did not use the parameter file of other organism when running Augustus. > I created the parameter file for the genome following their instructions. > There were multiple steps to train and run Augustus (Creating gene > structures for training AUGUSTUS with CEGMA => parameter file will be > created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; > Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) > As I mentioned the reason why I ran Augustus separately, because Augustus > has not trained that genome (no parameter file exists). Otherwise I would > run Augustus inside MAKER. > > You suggested to use rm_gff option to specify RepeatMasker output (sure I > will convert them to .gff3 formatted files). Can I submit two RM .gff3 > files, separated by comma? > > Anh-Dao > > > On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >> Hi Anh-Dao, >> >> In the maker_opts.ctl file, there are options for est and protein >> evidence. You?ll put all of your fasta est files together in a command >> separated list in the ?est" option, and all of your fasta protein files >> in a command separated list for the ?protein? option. >> >> You?ll specify the SNAP and Genemark files in their respective options in >> the control file and pass the augustus and fgenesh predictions in the >> ?pred_gff? option. >> >> If you have the RepeatMasker output in gff3 format you can give it to >> maker with the ?rm_gff? option. >> >> If you?ve converted the cufflinks output to gff3, you can give it to >> maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >> output, so you would put that in the ?est? option, along with all the >> other est fasta files. >> >> If Augustus isn?t trained for your particular organism, then you can use >> another organism that augustus is already trained for. The list of >> species that augustus has parameter files for is in the README.txt that >> came with Augustus. I really recommend that you run Augustus from inside >> maker, because then you get all the benefits of maker passing ext-based >> hints to augustus at runtime, which can really improve Augustus? >> predictive ability. >> >> When you ran the augustus gene prediction separately, did you use another >> organism?s parameter file? >> >> Thanks, >> Daniel >> >> >> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>> format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>> the outputs to .gff3 files. The reason why I ran Augustus gene >>> prediction separately, because the genome has never been trained for >>> Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>> maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Wed Jul 16 13:43:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:43:33 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: You can use comma separated lists. --Carson On 7/16/14, 12:42 PM, "Daniel Ence" wrote: >Hi Anh-Dao, so as I understand it, the process of training and running >augustus will create a set of ?param? file that Augustus can use later >on. If that?s true, then you can just copy those files to the >?config/species? folder of your augustus installation and then augustus >(when you call it from inside maker) can use those parameters when it >runs. > >Did you end up with a gff3 file or with files like ?exon_prob?, >?utr_probs? from augustus? Or did you have both? > >I?m pretty sure that you can?t use a comma-separated list for the rm_gff. >You could concatenate the two files and then pass the one file to maker, >but you also might need to have it sorted by genomic location. Carson >could confirm that for me. > >~Daniel > > >On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Thanks Daniel for your quick response. >> >> I did not use the parameter file of other organism when running >>Augustus. >> I created the parameter file for the genome following their >>instructions. >> There were multiple steps to train and run Augustus (Creating gene >> structures for training AUGUSTUS with CEGMA => parameter file will be >> created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >> Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >> As I mentioned the reason why I ran Augustus separately, because >>Augustus >> has not trained that genome (no parameter file exists). Otherwise I >>would >> run Augustus inside MAKER. >> >> You suggested to use rm_gff option to specify RepeatMasker output (sure >>I >> will convert them to .gff3 formatted files). Can I submit two RM .gff3 >> files, separated by comma? >> >> Anh-Dao >> >> >> On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>> Hi Anh-Dao, >>> >>> In the maker_opts.ctl file, there are options for est and protein >>> evidence. You?ll put all of your fasta est files together in a command >>> separated list in the ?est" option, and all of your fasta protein files >>> in a command separated list for the ?protein? option. >>> >>> You?ll specify the SNAP and Genemark files in their respective options >>>in >>> the control file and pass the augustus and fgenesh predictions in the >>> ?pred_gff? option. >>> >>> If you have the RepeatMasker output in gff3 format you can give it to >>> maker with the ?rm_gff? option. >>> >>> If you?ve converted the cufflinks output to gff3, you can give it to >>> maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>> output, so you would put that in the ?est? option, along with all the >>> other est fasta files. >>> >>> If Augustus isn?t trained for your particular organism, then you can >>>use >>> another organism that augustus is already trained for. The list of >>> species that augustus has parameter files for is in the README.txt that >>> came with Augustus. I really recommend that you run Augustus from >>>inside >>> maker, because then you get all the benefits of maker passing ext-based >>> hints to augustus at runtime, which can really improve Augustus? >>> predictive ability. >>> >>> When you ran the augustus gene prediction separately, did you use >>>another >>> organism?s parameter file? >>> >>> Thanks, >>> Daniel >>> >>> >>> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>> format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>> the outputs to .gff3 files. The reason why I ran Augustus gene >>>> prediction separately, because the genome has never been trained for >>>> Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>> maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 14:07:45 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:07:45 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I will run Augustus and FGENESH++ inside of MAKER using the parameter files for Augustus. I could also run RepeatMasker inside of MAKER. However, I ran RM using two options: -lib (de novo) and -species (known). I got ~ 45% repeats via de novo and ~ 4% repeats via known options. As I understood, RM inside of MAKER uses only RepBase repeat library and RepeatRunner protein database. Anh-Dao On 7/16/14 2:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From nguyenan at mail.nih.gov Wed Jul 16 14:16:43 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:16:43 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I forget to mention that I ran RepeatModeler on the genome first, then used the output of RepeatModeler to submit to RepeatMasker using -lib option (de novo). For the -species option, I used metazoa Anh-Dao On 7/16/14 3:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From carsonhh at gmail.com Wed Jul 16 14:17:31 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:17:31 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: No. You can provide both to MAKER. The options are model_org= and rmlib=. By letting MAKER handle repeat masking it will differentiate repeat types and use soft masking for some and hard masking for others. This increases sensitivity of evidence alignments while still maintaining specificity. --Carson On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From nguyenan at mail.nih.gov Wed Jul 16 14:28:33 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:28:33 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: By default, model_org=all. Can I use the de novo repeat library predicted by RepeatModeler for the rmlib option? Anh-Dao On 7/16/14 3:17 PM, "Carson Holt" wrote: >No. You can provide both to MAKER. The options are model_org= and rmlib=. > By letting MAKER handle repeat masking it will differentiate repeat types >and use soft masking for some and hard masking for others. This increases >sensitivity of evidence alignments while still maintaining specificity. > >--Carson > > > >On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>files for Augustus. >>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>two >>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>novo and ~ 4% repeats via known options. As I understood, RM inside of >>MAKER uses only RepBase repeat library and RepeatRunner protein database. >> >>Anh-Dao >> >> >>On 7/16/14 2:36 PM, "Carson Holt" wrote: >> >>>When you ran Augustus separately, it should have created the parameters >>>needed to run it. Now you should be able to run it inside of MAKER >>>using >>>the species name you just created. >>> >>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>giving it the results as GFF3. >>> >>>--Carson >>> >>> >>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>Thanks Daniel for your quick response. >>>> >>>>I did not use the parameter file of other organism when running >>>>Augustus. >>>>I created the parameter file for the genome following their >>>>instructions. >>>>There were multiple steps to train and run Augustus (Creating gene >>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>As I mentioned the reason why I ran Augustus separately, because >>>>Augustus >>>>has not trained that genome (no parameter file exists). Otherwise I >>>>would >>>>run Augustus inside MAKER. >>>> >>>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>>I >>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>files, separated by comma? >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>> >>>>>Hi Anh-Dao, >>>>> >>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>evidence. You?ll put all of your fasta est files together in a command >>>>>separated list in the ?est" option, and all of your fasta protein >>>>>files >>>>>in a command separated list for the ?protein? option. >>>>> >>>>>You?ll specify the SNAP and Genemark files in their respective options >>>>>in >>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>?pred_gff? option. >>>>> >>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>maker with the ?rm_gff? option. >>>>> >>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>fasta >>>>>output, so you would put that in the ?est? option, along with all the >>>>>other est fasta files. >>>>> >>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>use >>>>>another organism that augustus is already trained for. The list of >>>>>species that augustus has parameter files for is in the README.txt >>>>>that >>>>>came with Augustus. I really recommend that you run Augustus from >>>>>inside >>>>>maker, because then you get all the benefits of maker passing >>>>>ext-based >>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>predictive ability. >>>>> >>>>>When you ran the augustus gene prediction separately, did you use >>>>>another >>>>>organism?s parameter file? >>>>> >>>>>Thanks, >>>>>Daniel >>>>> >>>>> >>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I would like to conduct a genome annotation and have the following >>>>>>data: >>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>options) >>>>>> - ESTs and RACE (fasta) >>>>>> - proteins (fasta) >>>>>> - proteins of related organisms (fasta) >>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>ZFF >>>>>>format, etc. ) >>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>convert >>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>prediction separately, because the genome has never been trained for >>>>>>Augustus. >>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>> >>>>>> Could you please let me know how can I specify parameters in the >>>>>>maker_opts.ctl file? >>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>> >>>>>> Thanks. >>>>>> Anh-Dao >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>> >>>> >>>>_______________________________________________ >>>>maker-devel mailing list >>>>maker-devel at box290.bluehost.com >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> > > From carsonhh at gmail.com Wed Jul 16 14:32:02 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:32:02 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: 'all' will use the whole of RepBase, or you can do 'metazoa' like your previous run. Then provide the RepeatModeler file to rmlib= --Carson On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >By default, model_org=all. Can I use the de novo repeat library predicted >by RepeatModeler for the rmlib option? > >Anh-Dao > > > >On 7/16/14 3:17 PM, "Carson Holt" wrote: > >>No. You can provide both to MAKER. The options are model_org= and >>rmlib=. >> By letting MAKER handle repeat masking it will differentiate repeat >>types >>and use soft masking for some and hard masking for others. This >>increases >>sensitivity of evidence alignments while still maintaining specificity. >> >>--Carson >> >> >> >>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>files for Augustus. >>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>two >>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>database. >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>> >>>>When you ran Augustus separately, it should have created the parameters >>>>needed to run it. Now you should be able to run it inside of MAKER >>>>using >>>>the species name you just created. >>>> >>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>giving it the results as GFF3. >>>> >>>>--Carson >>>> >>>> >>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>Thanks Daniel for your quick response. >>>>> >>>>>I did not use the parameter file of other organism when running >>>>>Augustus. >>>>>I created the parameter file for the genome following their >>>>>instructions. >>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>Augustus >>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>would >>>>>run Augustus inside MAKER. >>>>> >>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>(sure >>>>>I >>>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>>files, separated by comma? >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>> >>>>>>Hi Anh-Dao, >>>>>> >>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>command >>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>files >>>>>>in a command separated list for the ?protein? option. >>>>>> >>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>options >>>>>>in >>>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>>?pred_gff? option. >>>>>> >>>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>>maker with the ?rm_gff? option. >>>>>> >>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>fasta >>>>>>output, so you would put that in the ?est? option, along with all the >>>>>>other est fasta files. >>>>>> >>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>use >>>>>>another organism that augustus is already trained for. The list of >>>>>>species that augustus has parameter files for is in the README.txt >>>>>>that >>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>inside >>>>>>maker, because then you get all the benefits of maker passing >>>>>>ext-based >>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>predictive ability. >>>>>> >>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>another >>>>>>organism?s parameter file? >>>>>> >>>>>>Thanks, >>>>>>Daniel >>>>>> >>>>>> >>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>data: >>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>options) >>>>>>> - ESTs and RACE (fasta) >>>>>>> - proteins (fasta) >>>>>>> - proteins of related organisms (fasta) >>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>>ZFF >>>>>>>format, etc. ) >>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>convert >>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>prediction separately, because the genome has never been trained for >>>>>>>Augustus. >>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>> >>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>maker_opts.ctl file? >>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>> >>>>>>> Thanks. >>>>>>> Anh-Dao >>>>>>> >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>> >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >>> >> >> > From nguyenan at mail.nih.gov Thu Jul 17 09:19:34 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Thu, 17 Jul 2014 14:19:34 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I am not sure which fgenesh executable file should I use. fgenesh= #location of fgenesh executable When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need to specify a list of other executable programs (such as ppd, ppdn+, etc) Anh-Dao On 7/16/14 3:32 PM, "Carson Holt" wrote: >'all' will use the whole of RepBase, or you can do 'metazoa' like your >previous run. Then provide the RepeatModeler file to rmlib= > >--Carson > > > >On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>By default, model_org=all. Can I use the de novo repeat library predicted >>by RepeatModeler for the rmlib option? >> >>Anh-Dao >> >> >> >>On 7/16/14 3:17 PM, "Carson Holt" wrote: >> >>>No. You can provide both to MAKER. The options are model_org= and >>>rmlib=. >>> By letting MAKER handle repeat masking it will differentiate repeat >>>types >>>and use soft masking for some and hard masking for others. This >>>increases >>>sensitivity of evidence alignments while still maintaining specificity. >>> >>>--Carson >>> >>> >>> >>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>files for Augustus. >>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>two >>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>de >>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>database. >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>> >>>>>When you ran Augustus separately, it should have created the >>>>>parameters >>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>using >>>>>the species name you just created. >>>>> >>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>giving it the results as GFF3. >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>> wrote: >>>>> >>>>>>Thanks Daniel for your quick response. >>>>>> >>>>>>I did not use the parameter file of other organism when running >>>>>>Augustus. >>>>>>I created the parameter file for the genome following their >>>>>>instructions. >>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>Augustus >>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>would >>>>>>run Augustus inside MAKER. >>>>>> >>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>(sure >>>>>>I >>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>.gff3 >>>>>>files, separated by comma? >>>>>> >>>>>>Anh-Dao >>>>>> >>>>>> >>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>> >>>>>>>Hi Anh-Dao, >>>>>>> >>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>command >>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>files >>>>>>>in a command separated list for the ?protein? option. >>>>>>> >>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>options >>>>>>>in >>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>the >>>>>>>?pred_gff? option. >>>>>>> >>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>to >>>>>>>maker with the ?rm_gff? option. >>>>>>> >>>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>fasta >>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>the >>>>>>>other est fasta files. >>>>>>> >>>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>>use >>>>>>>another organism that augustus is already trained for. The list of >>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>that >>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>inside >>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>ext-based >>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>predictive ability. >>>>>>> >>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>another >>>>>>>organism?s parameter file? >>>>>>> >>>>>>>Thanks, >>>>>>>Daniel >>>>>>> >>>>>>> >>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>>data: >>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>options) >>>>>>>> - ESTs and RACE (fasta) >>>>>>>> - proteins (fasta) >>>>>>>> - proteins of related organisms (fasta) >>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>to >>>>>>>>ZFF >>>>>>>>format, etc. ) >>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>convert >>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>for >>>>>>>>Augustus. >>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>> >>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>maker_opts.ctl file? >>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Anh-Dao >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>. >>>>>>>>o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>> >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>>> >>> >>> >> > > From carsonhh at gmail.com Fri Jul 18 12:04:09 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 18 Jul 2014 11:04:09 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: It should just be 'fgenesh'. If it's not there you can still just give the GFF3. --Carson On 7/17/14, 8:19 AM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I am not sure which fgenesh executable file should I use. > >fgenesh= #location of fgenesh executable > >When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need >to specify a list of other executable programs (such as ppd, ppdn+, etc) > >Anh-Dao > > >On 7/16/14 3:32 PM, "Carson Holt" wrote: > >>'all' will use the whole of RepBase, or you can do 'metazoa' like your >>previous run. Then provide the RepeatModeler file to rmlib= >> >>--Carson >> >> >> >>On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>By default, model_org=all. Can I use the de novo repeat library >>>predicted >>>by RepeatModeler for the rmlib option? >>> >>>Anh-Dao >>> >>> >>> >>>On 7/16/14 3:17 PM, "Carson Holt" wrote: >>> >>>>No. You can provide both to MAKER. The options are model_org= and >>>>rmlib=. >>>> By letting MAKER handle repeat masking it will differentiate repeat >>>>types >>>>and use soft masking for some and hard masking for others. This >>>>increases >>>>sensitivity of evidence alignments while still maintaining specificity. >>>> >>>>--Carson >>>> >>>> >>>> >>>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>>files for Augustus. >>>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>>two >>>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>>de >>>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>>database. >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>>> >>>>>>When you ran Augustus separately, it should have created the >>>>>>parameters >>>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>>using >>>>>>the species name you just created. >>>>>> >>>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>>giving it the results as GFF3. >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>>> wrote: >>>>>> >>>>>>>Thanks Daniel for your quick response. >>>>>>> >>>>>>>I did not use the parameter file of other organism when running >>>>>>>Augustus. >>>>>>>I created the parameter file for the genome following their >>>>>>>instructions. >>>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>>structures for training AUGUSTUS with CEGMA => parameter file will >>>>>>>be >>>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>>Augustus >>>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>>would >>>>>>>run Augustus inside MAKER. >>>>>>> >>>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>>(sure >>>>>>>I >>>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>>.gff3 >>>>>>>files, separated by comma? >>>>>>> >>>>>>>Anh-Dao >>>>>>> >>>>>>> >>>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>>> >>>>>>>>Hi Anh-Dao, >>>>>>>> >>>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>>command >>>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>>files >>>>>>>>in a command separated list for the ?protein? option. >>>>>>>> >>>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>>options >>>>>>>>in >>>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>>the >>>>>>>>?pred_gff? option. >>>>>>>> >>>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>>to >>>>>>>>maker with the ?rm_gff? option. >>>>>>>> >>>>>>>>If you?ve converted the cufflinks output to gff3, you can give it >>>>>>>>to >>>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>>fasta >>>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>>the >>>>>>>>other est fasta files. >>>>>>>> >>>>>>>>If Augustus isn?t trained for your particular organism, then you >>>>>>>>can >>>>>>>>use >>>>>>>>another organism that augustus is already trained for. The list of >>>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>>that >>>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>>inside >>>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>>ext-based >>>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>>predictive ability. >>>>>>>> >>>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>>another >>>>>>>>organism?s parameter file? >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Daniel >>>>>>>> >>>>>>>> >>>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I would like to conduct a genome annotation and have the >>>>>>>>>following >>>>>>>>>data: >>>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>>options) >>>>>>>>> - ESTs and RACE (fasta) >>>>>>>>> - proteins (fasta) >>>>>>>>> - proteins of related organisms (fasta) >>>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>>to >>>>>>>>>ZFF >>>>>>>>>format, etc. ) >>>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>>convert >>>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>>for >>>>>>>>>Augustus. >>>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>>> >>>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>>maker_opts.ctl file? >>>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Anh-Dao >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> maker-devel mailing list >>>>>>>>> maker-devel at box290.bluehost.com >>>>>>>>> >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la >>>>>>>>>b >>>>>>>>>. >>>>>>>>>o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>> >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> > From jp.oeyen at uni-bonn.de Mon Jul 28 07:22:25 2014 From: jp.oeyen at uni-bonn.de (Jan Philip Oeyen) Date: Mon, 28 Jul 2014 14:22:25 +0200 Subject: [maker-devel] Forks.pm error when running maker with dsindex Message-ID: Hi all, we are currently having some unexpected errors when running maker on a genome which is split in several parts. Our cluster admin reported the following error message: Argument "ALRM" isn't numeric in exit at /share/scientific_bin/perlmodu les/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 2188. SIGTERM received SIGTERM received SIGTERM received We were using maker with the '-g' option on a single genome which is split into 20 parts, where 19 parts are equally large and the last contains about 20 sequences more. After that we ran Maker using dsindex to clean up the output. We are currently using maker v2.31 on 4 threads and forks v0.34. If any further info is needed to clarify the problem, please let me know and I will provide as much as possible. Thank you for your help! Best regards, Jan Philip Oeyen ZFMK // ZMB // University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mphoeppner at gmail.com Wed Jul 30 05:44:36 2014 From: mphoeppner at gmail.com (=?iso-8859-1?Q?Marc_H=F6ppner?=) Date: Wed, 30 Jul 2014 12:44:36 +0200 Subject: [maker-devel] Maker GFF output with features of 0 length Message-ID: <5C45F418-018B-4ACC-B682-E5659DB7F102@gmail.com> Hi, I?ve - more by accident - found that many of the gene builds I have generated with Maker (2.31.3) contain features with identical start and stop positions. For example: scaffold_2927 maker CDS 13013 13013 . + 1 ID=maker-scaffold_2927-augustus-gene-0.8-mRNA-1:cds;Parent=maker-scaffold_2927-augustus-gene-0.8-mRNA-1 This occurs seemingly randomly for all sorts of feature types and I have only seen this when running Maker on full assemblies. Before I start turning every stone, any ideas about possible explanations for this phenomenon? Is this likely some MPI-related communication issue, or NFS problems with synching data? Maker runs fine on our system, but that doesn?t mean that there aren?t any cryptic issues that only on these occasions read their head? Regarding the frequency, out of 450.000 GFF lines, 270 were affected in the case that I looked into the most. So it is pretty rare, but still... I am currently using Maker with openmpi-1.7.4 and the file system is mounter of NFS4 and IPoIB. I now switched to Maker 2.31.6, but have no strong reason to suspect that this will make a difference. Regards, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 3 08:12:07 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:12:07 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: The hints used by MAKER are CDSpart, exonpart, intronpart, and intron. You can play around with the extrinsic evidence configuration file if you want, but it's really not well documented, so I won't be able to provide much support. Thanks, Carson On 7/1/14, 6:31 AM, "Marc H?ppner" wrote: >Hi, > >sorry for resurrecting this topic. The issue was about the use of >ab-intio predictions and artefacts in the final maker gene builds. > >I think one potential issue that hasn?t been discussed here concerns >Makers? use of the extrinsic config file when running Augustus. This file >controls the ?weights? of different types of hints when running Augustus. >I don?t think it is made clear anywhere which extrinsic config file Maker >reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it >suggested that it would be useful to manipulate this file to improve >augustus performance (and in extension Makers performance). Finally, I am >not entirely sure which sorts of hints Maker creates for Augustus and to >which hint categories these would belong to (i.e. it makes no sense to >tweak the intronpart malus factor if Maker does not create such hints). >Perhaps it would be good to elaborate on that in the Maker documentation, >since it seems to be quite relevant for obtaining better results. Or does >such an explanation already exist somewhere? > > >/Marc > >Marc P. Hoeppner, PhD >Team Leader >BILS Genome Annotation Platform >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 05 Jun 2014, at 20:28, Carson Holt wrote: > >> One thing you might want to try is adding another predictor like SNAP >> together with Augustus and then process the MAKER results using EVM. We >> actually have a collaboration with the EVM group to produce a MAKER-EVM >> pipeline (MAKER 3.0). EVM will produce consensus models using the >> predictions and the evidence in the MAKER GFF3 which are generally >>better >> than just SNAP and Augustus with hints, so it might be able to remove >>some >> of the artifacts you are worried about. >> >> --Carson >> >> >> >> On 6/5/14, 12:24 PM, "Carson Holt" wrote: >> >>> Like I said. The predictors do the best they can, so there is probably >>> something about the regions to make the CDS, reading frame, or >>>start/stop >>> work that requires exons to be dropped or added. In several ant >>>genomes >>> we saw something like this caused by incorrect homopolymers in the >>> assembly which force the predictor to slightly alter the intron/exon >>> structure because otherwise the reading frame made no sense (the EST >>> alignments were used to confirmed that the assembly homopolymers were >>> incorrect - lots of bad single base pair deletions). >>> >>> The way hints work is as follows. At the simplest level ab initio >>> predictors are calculating the probability of being in different states >>> (intergenic, intron, exon, etc.). The hints increase the probability >>>of >>> being in the intron state where MAKER gives an intron hint or being in >>>an >>> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >>> likelihood of the ab intio gene predictor to call something similar in >>> structure to the evidence overlapping it. That being said, if there is >>> strong enough signal from something else in the sequence or my hints >>>won't >>> work with the splice sites in the region or the reading frame breaks, >>>then >>> no amount of hints can force augustus to make that model. >>> >>> --Carson >>> >>> >>> >>> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> thanks for the feedback. I spent some more time on this and am still >>>> somewhat unsatisfied with the whole thing? >>>> >>>> A few comments: >>>> >>>> I quite frequently see augustus and in extension Maker including exons >>>> that are not supported by EST/Protein evidence and are not critical >>>>for >>>> the gene model (i.e. I can take them out and still get a proper CDS). >>>> Maybe I don?t know enough about how Maker creates hints and more >>>> importantly what role these hints play for augustus, but I cannot >>>>really >>>> see a great effect (any, really) on the gene models even if both ESTs >>>>and >>>> proteins contradict an augustus gene model and the surplus exon is >>>> non-essential. >>>> >>>> (all evidence is provided as fasta files, protein2genome and >>>>est2genome >>>> are set to 0) >>>> >>>> As for the repeat library, I suppose this is a critical point. I am >>>>using >>>> repeats from a closely related species via Repeatmasker, modelled and >>>> filtered repeats from RepeatModeler and repeats derived from a >>>> high-coverage 454 data set. Not sure what else I can do to improve >>>>that. >>>> >>>> As for evidence, I am using the curated reference proteome from a >>>>related >>>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>>> reads. I don?t think it gets a whole lot better, in terms of what data >>>> can be used. >>>> >>>> So in summary, I just don?t get where I want to using Augustus and >>>>Maker >>>> - specifically, the gene models are full of weird, unsupported >>>>artefacts >>>> despite manually curating > 850 models for training. I suppose I was >>>> hoping for some secret trick to improve on this - but I guess there is >>>> none? Actually, if I only do a pure evidence build (seeing that my >>>>input >>>> data is very high quality), it looks better - which sort of goes >>>>against >>>> the premise of Maker :/ >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> On 27 May 2014, at 17:25, Carson Holt wrote: >>>> >>>>> Extra exons can be required for predictors to make sense of a region >>>>> (they >>>>> do the best they can). This can be due to imperfect assemblies or >>>>> repeats. For plants the repeat database is the the one thing that >>>>>will >>>>> most affect the annotation quality. You may need to spend some time >>>>> building the best repeat library you can. The repeat library is the >>>>> next >>>>> most important thing next to training the predictor, because they >>>>> confuse >>>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>>> regions (because repeats do encode real protein and protein >>>>>fragments). >>>>> Also when running now with MAKER make sure to include the entire >>>>> proteome >>>>> of a related species and not just UniProt, and you will get better >>>>> performance. Now that you have Augustus trained, using it inside of >>>>> MAKER >>>>> with an improved repeat library and additional protein evidence >>>>>should >>>>> give it the feedback that will allow it to perform better than it >>>>>would >>>>> with just naked ab initio prediction. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I wanted to get some feedback regarding the training of ab-initio >>>>>>gene >>>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>>> many >>>>>> people on this list that have encountered and solved this issue in >>>>>>one >>>>>> way or another. >>>>>> >>>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for >>>>>>a >>>>>> plant genome. This has always been a very frustrating process for >>>>>>me, >>>>>> but >>>>>> while I have a better idea now how to do it, I still don?t get the >>>>>> sort >>>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>>> >>>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>>> Sanger-sequenced EST data >>>>>> >>>>>> Filtered for Models with an AED <= 0.3 >>>>>> >>>>>> Loaded that into WebApollo, together with an existing reference >>>>>> annotation and the evidence tracks >>>>>> >>>>>> Manually curated/selected 750 gene models using the following rules: >>>>>> - Must have start/stop codon >>>>>> - Most have as many exons as possible >>>>>> - Must agree with evidence >>>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>>> regions for augustus to train intergenic sequence) >>>>>> >>>>>> From these models, I created a GBK file, split it into 650 (train) >>>>>> and >>>>>> 100 (test) models and created a new profile using the documented >>>>>> procedure. >>>>>> >>>>>> But: >>>>>> >>>>>> While the naked ab-init models created through maker get a lot of >>>>>> genes >>>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>>> Problems include: >>>>>> >>>>>> - random exon calls which are not supported by any line of evidence >>>>>> (~1 >>>>>> per gene model, I would guess) >>>>>> - poor congruency with some gene models (especially ones not used >>>>>>for >>>>>> training/testing) >>>>>> >>>>>> Is there any best-practice guide on how to improve this? The >>>>>>Augustus >>>>>> website is unfortunately quite poor on detail? My impression so far >>>>>>is >>>>>> that ramping up the number of training models isn?t really doing too >>>>>> much >>>>>> beyond a certain point (tried 400, 500 and 750). >>>>>> >>>>>> Regards, >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> Marc P. Hoeppner, PhD >>>>>> Team Leader >>>>>> BILS Genome Annotation Platform >>>>>> Department for Medical Biochemistry and Microbiology >>>>>> Uppsala University, Sweden >>>>>> marc.hoeppner at bils.se >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>rg >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From marc.hoeppner at bils.se Tue Jul 1 06:31:33 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Tue, 1 Jul 2014 14:31:33 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Hi, sorry for resurrecting this topic. The issue was about the use of ab-intio predictions and artefacts in the final maker gene builds. I think one potential issue that hasn?t been discussed here concerns Makers? use of the extrinsic config file when running Augustus. This file controls the ?weights? of different types of hints when running Augustus. I don?t think it is made clear anywhere which extrinsic config file Maker reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it suggested that it would be useful to manipulate this file to improve augustus performance (and in extension Makers performance). Finally, I am not entirely sure which sorts of hints Maker creates for Augustus and to which hint categories these would belong to (i.e. it makes no sense to tweak the intronpart malus factor if Maker does not create such hints). Perhaps it would be good to elaborate on that in the Maker documentation, since it seems to be quite relevant for obtaining better results. Or does such an explanation already exist somewhere? /Marc Marc P. Hoeppner, PhD Team Leader BILS Genome Annotation Platform Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 05 Jun 2014, at 20:28, Carson Holt wrote: > One thing you might want to try is adding another predictor like SNAP > together with Augustus and then process the MAKER results using EVM. We > actually have a collaboration with the EVM group to produce a MAKER-EVM > pipeline (MAKER 3.0). EVM will produce consensus models using the > predictions and the evidence in the MAKER GFF3 which are generally better > than just SNAP and Augustus with hints, so it might be able to remove some > of the artifacts you are worried about. > > --Carson > > > > On 6/5/14, 12:24 PM, "Carson Holt" wrote: > >> Like I said. The predictors do the best they can, so there is probably >> something about the regions to make the CDS, reading frame, or start/stop >> work that requires exons to be dropped or added. In several ant genomes >> we saw something like this caused by incorrect homopolymers in the >> assembly which force the predictor to slightly alter the intron/exon >> structure because otherwise the reading frame made no sense (the EST >> alignments were used to confirmed that the assembly homopolymers were >> incorrect - lots of bad single base pair deletions). >> >> The way hints work is as follows. At the simplest level ab initio >> predictors are calculating the probability of being in different states >> (intergenic, intron, exon, etc.). The hints increase the probability of >> being in the intron state where MAKER gives an intron hint or being in an >> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >> likelihood of the ab intio gene predictor to call something similar in >> structure to the evidence overlapping it. That being said, if there is >> strong enough signal from something else in the sequence or my hints won't >> work with the splice sites in the region or the reading frame breaks, then >> no amount of hints can force augustus to make that model. >> >> --Carson >> >> >> >> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> thanks for the feedback. I spent some more time on this and am still >>> somewhat unsatisfied with the whole thing? >>> >>> A few comments: >>> >>> I quite frequently see augustus and in extension Maker including exons >>> that are not supported by EST/Protein evidence and are not critical for >>> the gene model (i.e. I can take them out and still get a proper CDS). >>> Maybe I don?t know enough about how Maker creates hints and more >>> importantly what role these hints play for augustus, but I cannot really >>> see a great effect (any, really) on the gene models even if both ESTs and >>> proteins contradict an augustus gene model and the surplus exon is >>> non-essential. >>> >>> (all evidence is provided as fasta files, protein2genome and est2genome >>> are set to 0) >>> >>> As for the repeat library, I suppose this is a critical point. I am using >>> repeats from a closely related species via Repeatmasker, modelled and >>> filtered repeats from RepeatModeler and repeats derived from a >>> high-coverage 454 data set. Not sure what else I can do to improve that. >>> >>> As for evidence, I am using the curated reference proteome from a related >>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>> reads. I don?t think it gets a whole lot better, in terms of what data >>> can be used. >>> >>> So in summary, I just don?t get where I want to using Augustus and Maker >>> - specifically, the gene models are full of weird, unsupported artefacts >>> despite manually curating > 850 models for training. I suppose I was >>> hoping for some secret trick to improve on this - but I guess there is >>> none? Actually, if I only do a pure evidence build (seeing that my input >>> data is very high quality), it looks better - which sort of goes against >>> the premise of Maker :/ >>> >>> Regards, >>> >>> Marc >>> >>> >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> On 27 May 2014, at 17:25, Carson Holt wrote: >>> >>>> Extra exons can be required for predictors to make sense of a region >>>> (they >>>> do the best they can). This can be due to imperfect assemblies or >>>> repeats. For plants the repeat database is the the one thing that will >>>> most affect the annotation quality. You may need to spend some time >>>> building the best repeat library you can. The repeat library is the >>>> next >>>> most important thing next to training the predictor, because they >>>> confuse >>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>> regions (because repeats do encode real protein and protein fragments). >>>> Also when running now with MAKER make sure to include the entire >>>> proteome >>>> of a related species and not just UniProt, and you will get better >>>> performance. Now that you have Augustus trained, using it inside of >>>> MAKER >>>> with an improved repeat library and additional protein evidence should >>>> give it the feedback that will allow it to perform better than it would >>>> with just naked ab initio prediction. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>> >>>>> Hi, >>>>> >>>>> I wanted to get some feedback regarding the training of ab-initio gene >>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>> many >>>>> people on this list that have encountered and solved this issue in one >>>>> way or another. >>>>> >>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>>> plant genome. This has always been a very frustrating process for me, >>>>> but >>>>> while I have a better idea now how to do it, I still don?t get the >>>>> sort >>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>> >>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>> Sanger-sequenced EST data >>>>> >>>>> Filtered for Models with an AED <= 0.3 >>>>> >>>>> Loaded that into WebApollo, together with an existing reference >>>>> annotation and the evidence tracks >>>>> >>>>> Manually curated/selected 750 gene models using the following rules: >>>>> - Must have start/stop codon >>>>> - Most have as many exons as possible >>>>> - Must agree with evidence >>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>> regions for augustus to train intergenic sequence) >>>>> >>>>> From these models, I created a GBK file, split it into 650 (train) >>>>> and >>>>> 100 (test) models and created a new profile using the documented >>>>> procedure. >>>>> >>>>> But: >>>>> >>>>> While the naked ab-init models created through maker get a lot of >>>>> genes >>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>> Problems include: >>>>> >>>>> - random exon calls which are not supported by any line of evidence >>>>> (~1 >>>>> per gene model, I would guess) >>>>> - poor congruency with some gene models (especially ones not used for >>>>> training/testing) >>>>> >>>>> Is there any best-practice guide on how to improve this? The Augustus >>>>> website is unfortunately quite poor on detail? My impression so far is >>>>> that ramping up the number of training models isn?t really doing too >>>>> much >>>>> beyond a certain point (tried 400, 500 and 750). >>>>> >>>>> Regards, >>>>> >>>>> Marc >>>>> >>>>> >>>>> Marc P. Hoeppner, PhD >>>>> Team Leader >>>>> BILS Genome Annotation Platform >>>>> Department for Medical Biochemistry and Microbiology >>>>> Uppsala University, Sweden >>>>> marc.hoeppner at bils.se >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From rajesh.bommareddy at tu-harburg.de Thu Jul 3 08:45:59 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 03 Jul 2014 16:45:59 +0200 Subject: [maker-devel] Maker output Message-ID: <53B56CA7.80108@tu-harburg.de> Dear Maker group I have run the example files provided with maker. But i am unable to understand the output. Where can i find the information about exons, CDS, protein sequence of the predicted CDS or mRNA and the predicted protein name for each contig? Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Thu Jul 3 08:51:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:51:57 -0600 Subject: [maker-devel] Maker output In-Reply-To: <53B56CA7.80108@tu-harburg.de> References: <53B56CA7.80108@tu-harburg.de> Message-ID: See the MAKER 2014 GMOD tutorial --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_ GMOD_Online_Training_2014 Also watch accompanying video --> http://youtu.be/uA96tSSaqLk Results will be in GFF3 and FASTA format. The GFF3 file contains the location of structure relative to the assembly (exon/CDS/UTR). The FASTA file contains the sequence (transcript/protein). There will be separate files for each contig. Use gff3_merge and fasta_merge to generate merged genome wide GFF3 and FASTA files. An explanation of GFF3 format is here --> http://www.sequenceontology.org/gff3.shtml Thanks, Carson On 7/3/14, 8:45 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Maker group > >I have run the example files provided with maker. But i am unable to >understand the output. Where can i find the information about exons, >CDS, protein sequence of the predicted CDS or mRNA and the predicted >protein name for each contig? > > >Thanks and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From dence at genetics.utah.edu Mon Jul 7 08:24:33 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 14:24:33 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: <8219A0C0-DBB0-4417-8B4F-39D6D7F93B93@genetics.utah.edu> Hi Saad, I think that's correct. As a sub step for each of the steps you listed, I would also choose one or two large scaffolds out of your assembly to use as a test set and use that test set to make sure that all you are getting output like you'd expect, before running MAKER on the whole genome. Let me know if there's anything else we can do to help. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 7:08 AM, Saad Arif > wrote: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt > wrote: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence > Cc: ">" > Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 09:26:05 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 11:26:05 -0400 Subject: [maker-devel] Couple quick questions about Maker Message-ID: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon Jul 7 10:00:45 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 16:00:45 +0000 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: Message-ID: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [LinkedIn] [http://ws-stats.appspot.com/ga/pixel.png?yes__count=true%20&e=legacy_impression] _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 10:26:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 10:26:43 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff option (which is pretty different). Also If you provide GFF# files for repeats, you will still need to turn of repeat masking in the control files by blanking out the options. Also MAKER uses a step called RepeatRunner against an internal transposable element protein databases which is probably still running (and is slow because it's a search in translated protein space). For performance, you may want to give a larger max_dna_len for the MAKER run given that you have a large RAM machine. Also set all the depth_blast in maker_bopts.ctl to 15 or 20. CEGMA is convenient for training predictors because it finds genes that will always be in every eukaryote (I.e. high confidence). You can combine these with est2genome/protein2genome results from MAKER if you want. You can then use the resulting HMM for a larger MAKER run with experimental evidence, and then train again on those results. But beware than there is rarely any benefit from training beyond that second round. More training actually tends to makes things worse (the overtraining paradox). --Carson From: Daniel Ence Date: Monday, July 7, 2014 at 10:00 AM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue wrote: > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome assembly). > Both these genomes are a little fragmented, so there are lots of contigs, > which slows down the whole process. I am looking for ways to speed things up > and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST and > protein evidence to build models. Had already done RepeatMasking so thought > I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so > two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one > that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should > generally allow the program to bypass the RepeatMasking step, correct? Does it > also make it bypass the Repeat ORF searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step > SNAP training from the tutorials seems straightforward, but I was wondering > about the Augustus step. From what I can tell, simply providing an Augustus > "trained" species name should turn on Augustus and blast/blat-like hints > generated within Maker are then used in gene prediction. Any thoughts on if > it's either more accurate or faster to do the Augustus predictions outside of > the Maker pipeline and then import them using the pred_gff parameter in the > maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files to > zff file for snap training? Currently, I am using predicted transcript for > this species and protein sequences from related species to training. Does > anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything > in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > Nathaniel Jue, Ph.D. > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > > iel-jue%2F1%2F531%2F176%2F&sn=> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 11:21:50 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 13:21:50 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 11:26:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 11:26:34 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just ^C. If you change options, then it will restart at a point determined by what will be affected by the change. Since repeat masking affects everything downstream, everything will start from zero. If it was a step like changing the HMM or altering blastn_depth, then it would be less disruptive and MAKER could reuse all existing raw reports. Unfortunately it's not that way for altering repeat masking options. --Carson From: Nathaniel Jue Date: Monday, July 7, 2014 at 11:21 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Tue Jul 8 09:56:37 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Tue, 8 Jul 2014 11:56:37 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 8 10:31:40 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 08 Jul 2014 10:31:40 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Convert them both to ZFF, then concatenate the ZFF and sequence files. --Carson From: Nathaniel Jue Date: Tuesday, July 8, 2014 at 9:56 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Mon Jul 7 07:08:53 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Mon, 7 Jul 2014 15:08:53 +0200 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt wrote: > "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" > > Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). > > If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > --Carson > > > > From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM > To: Daniel Ence > Cc: "" > Subject: Re: [maker-devel] Help with updating an annotation > > Thanks Daniel. I think it's more clear to me now. > > So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? > > Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. > > As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? > > Let me know if i'm still missing something here. > > Thanks in advance. > > best, > Saad > On 18 Jun 2014, at 17:21, Daniel Ence wrote: > >> Hi Saad, >> >> Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). >> >> You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. >> >> One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> >> Let me know if that helps, or if you have more question >> >> >> ~Daniel >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jun 18, 2014, at 5:09 AM, Saad Arif >> wrote: >> >>> Thank you for the response. I still have one question though, with these options: >>> >>> est_GFF=cufflinksout.GFF >>> >>> modle_GFF= ensembl reference.GFF >>> >>> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >>> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >>> >>> Any feedback on this would be greatly appreciated. >>> >>> saad >>> >>> On 13 Jun 2014, at 17:59, Carson Holt wrote: >>> >>>> Use the cufflinks instead of the tophat features (tophat tends to be >>>> really noisy). Give the existing models to model_gff (they will then >>>> always be kept unless something better is found). There is no option to >>>> keep models and then just add isoforms. The model_gff input will either >>>> be kept as is (unchanged), or replaced with an updated model suggested by >>>> the evidence (the updated model may contain multiple isoforms though), and >>>> map_forward=1 can be used to pull names forward from the old model onto >>>> the new models. >>>> >>>> Thansk, >>>> Carson >>>> >>>> >>>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>>> >>>>> Dear All, >>>>> >>>>> I would like to use Maker pipeline to expand a current annotation (new >>>>> isoforms and novel genes with respect to current annotation) and was >>>>> wondering if anyone had experience with this and or suggestions to my >>>>> questions. >>>>> >>>>> Briefly: >>>>> >>>>> I have tophat splice junctions from RNAseq data or alternatively >>>>> cufflinks generated transcript models (fasts format) that i want to use >>>>> as my new data (est_gff or est). >>>>> >>>>> I want to provide the current Ensembl annotation for gene prediction but >>>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>>> should provide this annotation as pred_gff >>>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>>> annotation for both options (pred_ and mod_gff)? >>>>> >>>>> >>>>> >>>>> Importantly, my main goal is to use the new RNAseq data to add more >>>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>>> thoughts or suggestions on how to go about this would be sincerely >>>>> appreciated. >>>>> >>>>> >>>>> Thanks in advance, >>>>> saad >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 15:38:52 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:38:52 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 15:44:48 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:44:48 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Also you can use repeat_gff in the control files, by I prefer just to rerun in the same directory as the previous job. --Carson From: Carson Holt Date: Thursday, July 10, 2014 at 3:38 PM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 16:02:57 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 15:02:57 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup Message-ID: Hi experts, I am trying to install Maker in a new machine (running Mac OS 10.7.5), and have succeed so far except for the "./Build exonerate" step, which gives me the following error: checking for socklen_t... yes checking for pkg-config... no ERROR: Could not find pkg-config ... is glib-2 installed ??? Fink for 64-bit is installed, and via 'fink list', I confimed that glib2-dev and -shlibs are installed. I unistalled and re-installed both fink and glib2 several times, hoping it was a configuration problem, but still get stuck at this step. I found a few previous questions about this issue in this forum, but the solutions Carson provided were directed for OS 10.6 only, apparently, so I did not try these. I have run into the limit of what I know how to do with these compilations. I tried setting up Exonerate directly but it has trouble finding glib as well. Any suggestions? Thank you so much! -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 17:41:59 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 16:41:59 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup In-Reply-To: References: Message-ID: OK, before anyone spends too much of their time trying to help me... I think I was able to solve my issue above. What I did was to install an additional glib2-related package using fink install. I installed glibmm2.4-dev, which also installs glibmm2.4-shlib. These make up a C++ interface for the glib2 library, according to their description. Once I installed those packages, I re-ran ./Build exonerate and it seemed to work. I tried a exonerate command in Terminal and it recognized it OK. Hopefully what I did won't cause any issues down the line. Thanks. On Thu, Jul 10, 2014 at 3:02 PM, Felipe Barreto wrote: > Hi experts, > > I am trying to install Maker in a new machine (running Mac OS 10.7.5), and > have succeed so far except for the "./Build exonerate" step, which gives me > the following error: > > checking for socklen_t... yes > checking for pkg-config... no > ERROR: Could not find pkg-config ... is glib-2 installed ??? > > > Fink for 64-bit is installed, and via 'fink list', I confimed that > glib2-dev and -shlibs are installed. I unistalled and re-installed both > fink and glib2 several times, hoping it was a configuration problem, but > still get stuck at this step. > > I found a few previous questions about this issue in this forum, but the > solutions Carson provided were directed for OS 10.6 only, apparently, so I > did not try these. I have run into the limit of what I know how to do with > these compilations. > > I tried setting up Exonerate directly but it has trouble finding glib as > well. > > Any suggestions? > > Thank you so much! > -- > Felipe Barreto > Post-doctoral Scholar > Scripps Institution of Oceanography > University of California, San Diego > La Jolla, CA 92093 > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 11 05:56:03 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 11 Jul 2014 13:56:03 +0200 Subject: [maker-devel] (no subject) Message-ID: I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Jul 11 08:08:43 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 11 Jul 2014 14:08:43 +0000 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Hi Panos, 1) You'll only use est2genome and protein2genome for creating models that will be used for training the ab-initio predictors (like SNAP). Sometimes that means one run of MAKER for training; sometimes that means two runs of MAKER. You usually don't gain any accuracy after the second round of training. It's ok to use both EST and protein data for this training step. 2) If you're using both ESTs and protein sequence to train your ab-initio predictors, then both est2genome and protein2genome should be set to 1. 3) If you want to pass Blast results to MAKER, you'll need to pass those results as GFF3, but MAKER will install and run blast for you, and does a good job of keeping track of all those results and making them accessible to you in the end, so it's going to be a lot of work to do those blasts on your own outside of MAKER. I seriously suggest that you use blast internal to maker. Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ________________________________ From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos Ioannidis [panos.ioannidis at gmail.com] Sent: Friday, July 11, 2014 5:56 AM To: maker-devel Subject: [maker-devel] (no subject) I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 14 01:20:50 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 14 Jul 2014 09:20:50 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models > that will be used for training the ab-initio predictors (like SNAP). > Sometimes that means one run of MAKER for training; sometimes that means > two runs of MAKER. You usually don't gain any accuracy after the second > round of training. It's ok to use both EST and protein data for this > training step. > > 2) If you're using both ESTs and protein sequence to train your > ab-initio predictors, then both est2genome and protein2genome should be set > to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a > good job of keeping track of all those results and making them accessible > to you in the end, so it's going to be a lot of work to do those blasts on > your own outside of MAKER. I seriously suggest that you use blast internal > to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ------------------------------ > *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of > Panos Ioannidis [panos.ioannidis at gmail.com] > *Sent:* Friday, July 11, 2014 5:56 AM > *To:* maker-devel > *Subject:* [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of > questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So > for the first run I see that some people use only the ESTs and some others > use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess > that the ESTs will give better models, but for the cases where genes aren't > covered by an EST, it's okay to have a protein database to detect them as > well. Am I right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? > Should they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and > giving Maker directly the results. I guess that in this case, I'll have to > first convert the BLAST output to a gff3 file and give it to the > protein_gff parameter, right? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 08:46:50 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:46:50 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 08:49:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:49:33 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Also one more question. What is the exact error text you get for the forks error? Is it a forks.pm error or is it an MPI warn on fork error (which are actually very different). --Carson From: Carson Holt Date: Monday, July 14, 2014 at 8:46 AM To: Panos Ioannidis , Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m aker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 00:59:18 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 08:59:18 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm, because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the > forks error? Is it a forks.pm error or is it an MPI warn on fork error > (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence < > dence at genetics.utah.edu> > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. > The filtering and polishing done by MAKER is rather significant (direct > BLAST is actually worse with homology searches than many people realize). > > With respect to forks.pm, your admin most likely edited the wrong forks.pm. > There may be more than one on your system. If you let maker install some > prerequisites for you (because it requires a specific version of forks.pm), > it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify > the exact location of the forks.pm being used. Or if he is editing it as > part of the install tarball, his edits may actually be undone during the > installation procedure. > > Use this command line to identify the location of the forks.pm module > that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm" > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is > because I'm currently not able to run Maker on our cluster due to a problem > in the Perl "forks" library. And it looks like there isn't much I can do > about it; I tried Perlbrew but it doesn't work when I try to install > versions <5.18 (the problem in forks occurs on 5.18 and later versions). > Our admin also tried to change the code in the forks.pm file as per > Carson's suggestion in another thread, but that didn't work either... As a > result I'm running Maker on my workstation (really slooow) till a solution > is found and since BLAST is a time-consuming step I was thinking of running > it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence > wrote: > >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible >> to you in the end, so it's going to be a lot of work to do those blasts on >> your own outside of MAKER. I seriously suggest that you use blast internal >> to maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> ------------------------------ >> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >> Panos Ioannidis [panos.ioannidis at gmail.com] >> *Sent:* Friday, July 11, 2014 5:56 AM >> *To:* maker-devel >> *Subject:* [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of >> questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So >> for the first run I see that some people use only the ESTs and some others >> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >> that the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? >> Should they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos >> > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 07:58:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 07:58:20 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you are getting a segfault. It is more likely an MPI error especially if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that have bugs on forks and system calls. If it is OpenMPI, run the following command before launching MAKER --> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so Make sure to set replace openmpi_location with the location of your OpenMPI. Also add the following to your MPI command before running MAKER. --> -mca btl ^openib Example --> mpiexec -mca btl ^openib -n 40 maker If you are using MVAPICH2, then you need to switch to OpenMPI. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 12:59 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm , because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the forks > error? Is it a forks.pm error or is it an MPI warn on fork > error (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence > > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. The > filtering and polishing done by MAKER is rather significant (direct BLAST is > actually worse with homology searches than many people realize). > > With respect to forks.pm , your admin most likely edited the > wrong forks.pm . There may be more than one on your system. > If you let maker install some prerequisites for you (because it requires a > specific version of forks.pm ), it may be in > .../maker/perl/lib/forks.pm . Otherwise you have to > identify the exact location of the forks.pm being used. Or > if he is editing it as part of the install tarball, his edits may actually be > undone during the installation procedure. > > Use this command line to identify the location of the forks.pm > module that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm " > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is because > I'm currently not able to run Maker on our cluster due to a problem in the > Perl "forks" library. And it looks like there isn't much I can do about it; I > tried Perlbrew but it doesn't work when I try to install versions <5.18 (the > problem in forks occurs on 5.18 and later versions). Our admin also tried to > change the code in the forks.pm file as per Carson's > suggestion in another thread, but that didn't work either... As a result I'm > running Maker on my workstation (really slooow) till a solution is found and > since BLAST is a time-consuming step I was thinking of running it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible to >> you in the end, so it's going to be a lot of work to do those blasts on your >> own outside of MAKER. I seriously suggest that you use blast internal to >> maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >> Ioannidis [panos.ioannidis at gmail.com] >> Sent: Friday, July 11, 2014 5:56 AM >> To: maker-devel >> Subject: [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So for >> the first run I see that some people use only the ESTs and some others use >> ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that >> the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >> they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 08:03:12 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 16:03:12 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially > if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries > that have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your > OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll give it > another try later today. > > As for the error, it's just "Segmentation fault"! And we know this > segfault is because of forks.pm, because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just > the line "use forks;" and try to run it, you'll get a segfault. So it looks > like it's something pretty general and serious, and I'm really surprised I > can't find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > >> Also one more question. What is the exact error text you get for the >> forks error? Is it a forks.pm error or is it an MPI warn on fork error >> (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence < >> dence at genetics.utah.edu> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. >> The filtering and polishing done by MAKER is rather significant (direct >> BLAST is actually worse with homology searches than many people realize). >> >> With respect to forks.pm, your admin most likely edited the wrong >> forks.pm. There may be more than one on your system. If you let maker >> install some prerequisites for you (because it requires a specific version >> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >> have to identify the exact location of the forks.pm being used. Or if he >> is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm module >> that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm" >> >> You can even send me a copy of the file once it has been edited, and I >> can tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install >> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >> Our admin also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution >> is found and since BLAST is a time-consuming step I was thinking of running >> it separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >> wrote: >> >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models >>> that will be used for training the ab-initio predictors (like SNAP). >>> Sometimes that means one run of MAKER for training; sometimes that means >>> two runs of MAKER. You usually don't gain any accuracy after the second >>> round of training. It's ok to use both EST and protein data for this >>> training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your >>> ab-initio predictors, then both est2genome and protein2genome should be set >>> to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible >>> to you in the end, so it's going to be a lot of work to do those blasts on >>> your own outside of MAKER. I seriously suggest that you use blast internal >>> to maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> ------------------------------ >>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>> Panos Ioannidis [panos.ioannidis at gmail.com] >>> *Sent:* Friday, July 11, 2014 5:56 AM >>> *To:* maker-devel >>> *Subject:* [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of >>> questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>> Should they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and >>> giving Maker directly the results. I guess that in this case, I'll have to >>> first convert the BLAST output to a gff3 file and give it to the >>> protein_gff parameter, right? >>> >>> Thanks, >>> Panos >>> >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 08:10:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 08:10:24 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you don't have MPI support, it's not an issue, and your Seg fault is likely something else. Your reference to perl 5.18 and forks.pm should not be a segfault error either, and would not represent your error. The Perl 5.18/forks.pm is a different issue where perl actually tells itself to die because hash reshuffling isn't safe whereas segfaults are causes by binary corruption or incorrect memory access issues (very different issues). I'd actually recommend a full perl reinstall if you are getting segfaults, because it suggests a deeper issue. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 8:03 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially if > you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that > have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel > > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll > give it another try later today. > > As for the error, it's just "Segmentation fault"! And we know this segfault is > because of forks.pm , because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just the > line "use forks;" and try to run it, you'll get a segfault. So it looks like > it's something pretty general and serious, and I'm really surprised I can't > find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> Also one more question. What is the exact error text you get for the forks >> error? Is it a forks.pm error or is it an MPI warn on >> fork error (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence >> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. The >> filtering and polishing done by MAKER is rather significant (direct BLAST is >> actually worse with homology searches than many people realize). >> >> With respect to forks.pm , your admin most likely edited >> the wrong forks.pm . There may be more than one on your >> system. If you let maker install some prerequisites for you (because it >> requires a specific version of forks.pm ), it may be in >> .../maker/perl/lib/forks.pm . Otherwise you have to >> identify the exact location of the forks.pm being used. Or >> if he is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm >> module that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm " >> >> You can even send me a copy of the file once it has been edited, and I can >> tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install versions >> <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin >> also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution is >> found and since BLAST is a time-consuming step I was thinking of running it >> separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models that >>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>> that means one run of MAKER for training; sometimes that means two runs of >>> MAKER. You usually don't gain any accuracy after the second round of >>> training. It's ok to use both EST and protein data for this training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>> predictors, then both est2genome and protein2genome should be set to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible to >>> you in the end, so it's going to be a lot of work to do those blasts on your >>> own outside of MAKER. I seriously suggest that you use blast internal to >>> maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>> Ioannidis [panos.ioannidis at gmail.com] >>> Sent: Friday, July 11, 2014 5:56 AM >>> To: maker-devel >>> Subject: [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>> they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and giving >>> Maker directly the results. I guess that in this case, I'll have to first >>> convert the BLAST output to a gff3 file and give it to the protein_gff >>> parameter, right? >>> >>> Thanks, >>> Panos >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Wed Jul 16 06:26:56 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Wed, 16 Jul 2014 14:26:56 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm should > not be a segfault error either, and would not represent your error. The > Perl 5.18/forks.pm is a different issue where perl actually tells itself > to die because hash reshuffling isn't safe whereas segfaults are causes by > binary corruption or incorrect memory access issues (very different > issues). I'd actually recommend a full perl reinstall if you are getting > segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only > occurring when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > >> If you are getting a segfault. It is more likely an MPI error especially >> if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries >> that have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your >> OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel < >> maker-devel at yandell-lab.org> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! We'll give it >> another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this >> segfault is because of forks.pm, because if you remove the "use forks;" >> line script execution continues without segfault (till it crashes later for >> another reason, of course). In fact, even if you create a script with just >> the line "use forks;" and try to run it, you'll get a segfault. So it looks >> like it's something pretty general and serious, and I'm really surprised I >> can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> >>> Also one more question. What is the exact error text you get for the >>> forks error? Is it a forks.pm error or is it an MPI warn on fork error >>> (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence < >>> dence at genetics.utah.edu> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. >>> The filtering and polishing done by MAKER is rather significant (direct >>> BLAST is actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm, your admin most likely edited the wrong >>> forks.pm. There may be more than one on your system. If you let maker >>> install some prerequisites for you (because it requires a specific version >>> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >>> have to identify the exact location of the forks.pm being used. Or if >>> he is editing it as part of the install tarball, his edits may actually be >>> undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm module >>> that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm" >>> >>> You can even send me a copy of the file once it has been edited, and I >>> can tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >>> Our admin also tried to change the code in the forks.pm file as per >>> Carson's suggestion in another thread, but that didn't work either... As a >>> result I'm running Maker on my workstation (really slooow) till a solution >>> is found and since BLAST is a time-consuming step I was thinking of running >>> it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>> >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models >>>> that will be used for training the ab-initio predictors (like SNAP). >>>> Sometimes that means one run of MAKER for training; sometimes that means >>>> two runs of MAKER. You usually don't gain any accuracy after the second >>>> round of training. It's ok to use both EST and protein data for this >>>> training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your >>>> ab-initio predictors, then both est2genome and protein2genome should be set >>>> to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass >>>> those results as GFF3, but MAKER will install and run blast for you, and >>>> does a good job of keeping track of all those results and making them >>>> accessible to you in the end, so it's going to be a lot of work to do those >>>> blasts on your own outside of MAKER. I seriously suggest that you use blast >>>> internal to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> ------------------------------ >>>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>>> Panos Ioannidis [panos.ioannidis at gmail.com] >>>> *Sent:* Friday, July 11, 2014 5:56 AM >>>> *To:* maker-devel >>>> *Subject:* [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of >>>> questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's >>>> already sequenced, I will have to run maker twice (according to the >>>> tutorial). So for the first run I see that some people use only the ESTs >>>> and some others use ESTs and a protein database (CEGMA, Uniref50, >>>> Swiss-Prot, etc). I guess that the ESTs will give better models, but for >>>> the cases where genes aren't covered by an EST, it's okay to have a protein >>>> database to detect them as well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>>> Should they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and >>>> giving Maker directly the results. I guess that in this case, I'll have to >>>> first convert the BLAST output to a gff3 file and give it to the >>>> protein_gff parameter, right? >>>> >>>> Thanks, >>>> Panos >>>> >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 16 08:04:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 08:04:55 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: You don't have to do a system wide install. It is incredibly easy to have multiple installations of Perl. Perlbrew for example makes it easy to install and switch between multiple versions rapidly (and doesn't affect the system install) --> http://perlbrew.pl You can then test. The perl installation used by different programs is determined by the '#!' header in the executable script and not by the default location of your system's perl (look at the first line in .../maker/bin/maker and you will see what I mean). This value gets set during the initial installation, and whatever perl path you use to run MAKER's Build.PL script will end up being the one used to run MAKER, even if the system perl is different. --Carson From: Panos Ioannidis Date: Wednesday, July 16, 2014 at 6:26 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm > should not be a segfault error either, and would not > represent your error. The Perl 5.18/forks.pm is a different > issue where perl actually tells itself to die because hash reshuffling isn't > safe whereas segfaults are causes by binary corruption or incorrect memory > access issues (very different issues). I'd actually recommend a full perl > reinstall if you are getting segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel > > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only occurring > when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: >> If you are getting a segfault. It is more likely an MPI error especially if >> you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that >> have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel >> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! >> We'll give it another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this segfault >> is because of forks.pm , because if you remove the "use >> forks;" line script execution continues without segfault (till it crashes >> later for another reason, of course). In fact, even if you create a script >> with just the line "use forks;" and try to run it, you'll get a segfault. So >> it looks like it's something pretty general and serious, and I'm really >> surprised I can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >>> Also one more question. What is the exact error text you get for the forks >>> error? Is it a forks.pm error or is it an MPI warn on >>> fork error (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence >>> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. The >>> filtering and polishing done by MAKER is rather significant (direct BLAST is >>> actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm , your admin most likely edited >>> the wrong forks.pm . There may be more than one on your >>> system. If you let maker install some prerequisites for you (because it >>> requires a specific version of forks.pm ), it may be in >>> .../maker/perl/lib/forks.pm . Otherwise you have to >>> identify the exact location of the forks.pm being used. >>> Or if he is editing it as part of the install tarball, his edits may >>> actually be undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm >>> module that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm " >>> >>> You can even send me a copy of the file once it has been edited, and I can >>> tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our >>> admin also tried to change the code in the forks.pm file >>> as per Carson's suggestion in another thread, but that didn't work either... >>> As a result I'm running Maker on my workstation (really slooow) till a >>> solution is found and since BLAST is a time-consuming step I was thinking of >>> running it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models that >>>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>>> that means one run of MAKER for training; sometimes that means two runs of >>>> MAKER. You usually don't gain any accuracy after the second round of >>>> training. It's ok to use both EST and protein data for this training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>>> predictors, then both est2genome and protein2genome should be set to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>>> results as GFF3, but MAKER will install and run blast for you, and does a >>>> good job of keeping track of all those results and making them accessible >>>> to you in the end, so it's going to be a lot of work to do those blasts on >>>> your own outside of MAKER. I seriously suggest that you use blast internal >>>> to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> >>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>>> Ioannidis [panos.ioannidis at gmail.com] >>>> Sent: Friday, July 11, 2014 5:56 AM >>>> To: maker-devel >>>> Subject: [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's already >>>> sequenced, I will have to run maker twice (according to the tutorial). So >>>> for the first run I see that some people use only the ESTs and some others >>>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>>> that the ESTs will give better models, but for the cases where genes aren't >>>> covered by an EST, it's okay to have a protein database to detect them as >>>> well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>>> they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and giving >>>> Maker directly the results. I guess that in this case, I'll have to first >>>> convert the BLAST output to a gff3 file and give it to the protein_gff >>>> parameter, right? >>>> >>>> Thanks, >>>> Panos >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nguyenan at mail.nih.gov Wed Jul 16 11:15:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 17:15:10 +0000 Subject: [maker-devel] Maker_opts.ctl Message-ID: Hi, I would like to conduct a genome annotation and have the following data: - Two separate RepeatMasker outputs (using -lib and -species options) - ESTs and RACE (fasta) - proteins (fasta) - proteins of related organisms (fasta) - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) - GeneMark's .hmm file (es.mod file from running gm_es.pl) - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. - Cufflinks and Trinity from RNA-Seq Could you please let me know how can I specify parameters in the maker_opts.ctl file? Or do you have other suggestions to re-do the data listed above? Thanks. Anh-Dao From dence at genetics.utah.edu Wed Jul 16 12:13:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:13:46 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: Message-ID: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Hi Anh-Dao, In the maker_opts.ctl file, there are options for est and protein evidence. You?ll put all of your fasta est files together in a command separated list in the ?est" option, and all of your fasta protein files in a command separated list for the ?protein? option. You?ll specify the SNAP and Genemark files in their respective options in the control file and pass the augustus and fgenesh predictions in the ?pred_gff? option. If you have the RepeatMasker output in gff3 format you can give it to maker with the ?rm_gff? option. If you?ve converted the cufflinks output to gff3, you can give it to maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta output, so you would put that in the ?est? option, along with all the other est fasta files. If Augustus isn?t trained for your particular organism, then you can use another organism that augustus is already trained for. The list of species that augustus has parameter files for is in the README.txt that came with Augustus. I really recommend that you run Augustus from inside maker, because then you get all the benefits of maker passing ext-based hints to augustus at runtime, which can really improve Augustus? predictive ability. When you ran the augustus gene prediction separately, did you use another organism?s parameter file? Thanks, Daniel On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Hi, > > I would like to conduct a genome annotation and have the following data: > - Two separate RepeatMasker outputs (using -lib and -species options) > - ESTs and RACE (fasta) > - proteins (fasta) > - proteins of related organisms (fasta) > - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) > - GeneMark's .hmm file (es.mod file from running gm_es.pl) > - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. > - Cufflinks and Trinity from RNA-Seq > > Could you please let me know how can I specify parameters in the maker_opts.ctl file? > Or do you have other suggestions to re-do the data listed above? > > Thanks. > Anh-Dao > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 12:30:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 18:30:10 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Thanks Daniel for your quick response. I did not use the parameter file of other organism when running Augustus. I created the parameter file for the genome following their instructions. There were multiple steps to train and run Augustus (Creating gene structures for training AUGUSTUS with CEGMA => parameter file will be created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) As I mentioned the reason why I ran Augustus separately, because Augustus has not trained that genome (no parameter file exists). Otherwise I would run Augustus inside MAKER. You suggested to use rm_gff option to specify RepeatMasker output (sure I will convert them to .gff3 formatted files). Can I submit two RM .gff3 files, separated by comma? Anh-Dao On 7/16/14 2:13 PM, "Daniel Ence" wrote: >Hi Anh-Dao, > >In the maker_opts.ctl file, there are options for est and protein >evidence. You?ll put all of your fasta est files together in a command >separated list in the ?est" option, and all of your fasta protein files >in a command separated list for the ?protein? option. > >You?ll specify the SNAP and Genemark files in their respective options in >the control file and pass the augustus and fgenesh predictions in the >?pred_gff? option. > >If you have the RepeatMasker output in gff3 format you can give it to >maker with the ?rm_gff? option. > >If you?ve converted the cufflinks output to gff3, you can give it to >maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >output, so you would put that in the ?est? option, along with all the >other est fasta files. > >If Augustus isn?t trained for your particular organism, then you can use >another organism that augustus is already trained for. The list of >species that augustus has parameter files for is in the README.txt that >came with Augustus. I really recommend that you run Augustus from inside >maker, because then you get all the benefits of maker passing ext-based >hints to augustus at runtime, which can really improve Augustus? >predictive ability. > >When you ran the augustus gene prediction separately, did you use another >organism?s parameter file? > >Thanks, >Daniel > > >On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Hi, >> >> I would like to conduct a genome annotation and have the following data: >> - Two separate RepeatMasker outputs (using -lib and -species options) >> - ESTs and RACE (fasta) >> - proteins (fasta) >> - proteins of related organisms (fasta) >> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>format, etc. ) >> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>the outputs to .gff3 files. The reason why I ran Augustus gene >>prediction separately, because the genome has never been trained for >>Augustus. >> - Cufflinks and Trinity from RNA-Seq >> >> Could you please let me know how can I specify parameters in the >>maker_opts.ctl file? >> Or do you have other suggestions to re-do the data listed above? >> >> Thanks. >> Anh-Dao >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Wed Jul 16 12:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 12:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 12:41:47 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:41:47 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: If you can provide me the command lines you used to train augustus, I can point you to the proper species parameters to give to MAKER. Normally these are the same as one of the directory names under .../augustus/config/species/. You can also let MAKER run FGENESH for you. Either way you can pass it in as GFF3, but if you let MAEKR run it for you then MAEKR can "talk" to the predictor by giving it evidence based hints as it is running. This improves the overall performance of the algorithm compared to running it outside of MAKER. Thanks, Carson On 7/16/14, 12:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jul 16 12:42:16 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:42:16 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Hi Anh-Dao, so as I understand it, the process of training and running augustus will create a set of ?param? file that Augustus can use later on. If that?s true, then you can just copy those files to the ?config/species? folder of your augustus installation and then augustus (when you call it from inside maker) can use those parameters when it runs. Did you end up with a gff3 file or with files like ?exon_prob?, ?utr_probs? from augustus? Or did you have both? I?m pretty sure that you can?t use a comma-separated list for the rm_gff. You could concatenate the two files and then pass the one file to maker, but you also might need to have it sorted by genomic location. Carson could confirm that for me. ~Daniel On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Thanks Daniel for your quick response. > > I did not use the parameter file of other organism when running Augustus. > I created the parameter file for the genome following their instructions. > There were multiple steps to train and run Augustus (Creating gene > structures for training AUGUSTUS with CEGMA => parameter file will be > created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; > Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) > As I mentioned the reason why I ran Augustus separately, because Augustus > has not trained that genome (no parameter file exists). Otherwise I would > run Augustus inside MAKER. > > You suggested to use rm_gff option to specify RepeatMasker output (sure I > will convert them to .gff3 formatted files). Can I submit two RM .gff3 > files, separated by comma? > > Anh-Dao > > > On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >> Hi Anh-Dao, >> >> In the maker_opts.ctl file, there are options for est and protein >> evidence. You?ll put all of your fasta est files together in a command >> separated list in the ?est" option, and all of your fasta protein files >> in a command separated list for the ?protein? option. >> >> You?ll specify the SNAP and Genemark files in their respective options in >> the control file and pass the augustus and fgenesh predictions in the >> ?pred_gff? option. >> >> If you have the RepeatMasker output in gff3 format you can give it to >> maker with the ?rm_gff? option. >> >> If you?ve converted the cufflinks output to gff3, you can give it to >> maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >> output, so you would put that in the ?est? option, along with all the >> other est fasta files. >> >> If Augustus isn?t trained for your particular organism, then you can use >> another organism that augustus is already trained for. The list of >> species that augustus has parameter files for is in the README.txt that >> came with Augustus. I really recommend that you run Augustus from inside >> maker, because then you get all the benefits of maker passing ext-based >> hints to augustus at runtime, which can really improve Augustus? >> predictive ability. >> >> When you ran the augustus gene prediction separately, did you use another >> organism?s parameter file? >> >> Thanks, >> Daniel >> >> >> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>> format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>> the outputs to .gff3 files. The reason why I ran Augustus gene >>> prediction separately, because the genome has never been trained for >>> Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>> maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Wed Jul 16 12:43:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:43:33 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: You can use comma separated lists. --Carson On 7/16/14, 12:42 PM, "Daniel Ence" wrote: >Hi Anh-Dao, so as I understand it, the process of training and running >augustus will create a set of ?param? file that Augustus can use later >on. If that?s true, then you can just copy those files to the >?config/species? folder of your augustus installation and then augustus >(when you call it from inside maker) can use those parameters when it >runs. > >Did you end up with a gff3 file or with files like ?exon_prob?, >?utr_probs? from augustus? Or did you have both? > >I?m pretty sure that you can?t use a comma-separated list for the rm_gff. >You could concatenate the two files and then pass the one file to maker, >but you also might need to have it sorted by genomic location. Carson >could confirm that for me. > >~Daniel > > >On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Thanks Daniel for your quick response. >> >> I did not use the parameter file of other organism when running >>Augustus. >> I created the parameter file for the genome following their >>instructions. >> There were multiple steps to train and run Augustus (Creating gene >> structures for training AUGUSTUS with CEGMA => parameter file will be >> created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >> Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >> As I mentioned the reason why I ran Augustus separately, because >>Augustus >> has not trained that genome (no parameter file exists). Otherwise I >>would >> run Augustus inside MAKER. >> >> You suggested to use rm_gff option to specify RepeatMasker output (sure >>I >> will convert them to .gff3 formatted files). Can I submit two RM .gff3 >> files, separated by comma? >> >> Anh-Dao >> >> >> On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>> Hi Anh-Dao, >>> >>> In the maker_opts.ctl file, there are options for est and protein >>> evidence. You?ll put all of your fasta est files together in a command >>> separated list in the ?est" option, and all of your fasta protein files >>> in a command separated list for the ?protein? option. >>> >>> You?ll specify the SNAP and Genemark files in their respective options >>>in >>> the control file and pass the augustus and fgenesh predictions in the >>> ?pred_gff? option. >>> >>> If you have the RepeatMasker output in gff3 format you can give it to >>> maker with the ?rm_gff? option. >>> >>> If you?ve converted the cufflinks output to gff3, you can give it to >>> maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>> output, so you would put that in the ?est? option, along with all the >>> other est fasta files. >>> >>> If Augustus isn?t trained for your particular organism, then you can >>>use >>> another organism that augustus is already trained for. The list of >>> species that augustus has parameter files for is in the README.txt that >>> came with Augustus. I really recommend that you run Augustus from >>>inside >>> maker, because then you get all the benefits of maker passing ext-based >>> hints to augustus at runtime, which can really improve Augustus? >>> predictive ability. >>> >>> When you ran the augustus gene prediction separately, did you use >>>another >>> organism?s parameter file? >>> >>> Thanks, >>> Daniel >>> >>> >>> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>> format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>> the outputs to .gff3 files. The reason why I ran Augustus gene >>>> prediction separately, because the genome has never been trained for >>>> Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>> maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 13:07:45 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:07:45 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I will run Augustus and FGENESH++ inside of MAKER using the parameter files for Augustus. I could also run RepeatMasker inside of MAKER. However, I ran RM using two options: -lib (de novo) and -species (known). I got ~ 45% repeats via de novo and ~ 4% repeats via known options. As I understood, RM inside of MAKER uses only RepBase repeat library and RepeatRunner protein database. Anh-Dao On 7/16/14 2:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From nguyenan at mail.nih.gov Wed Jul 16 13:16:43 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:16:43 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I forget to mention that I ran RepeatModeler on the genome first, then used the output of RepeatModeler to submit to RepeatMasker using -lib option (de novo). For the -species option, I used metazoa Anh-Dao On 7/16/14 3:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From carsonhh at gmail.com Wed Jul 16 13:17:31 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:17:31 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: No. You can provide both to MAKER. The options are model_org= and rmlib=. By letting MAKER handle repeat masking it will differentiate repeat types and use soft masking for some and hard masking for others. This increases sensitivity of evidence alignments while still maintaining specificity. --Carson On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From nguyenan at mail.nih.gov Wed Jul 16 13:28:33 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:28:33 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: By default, model_org=all. Can I use the de novo repeat library predicted by RepeatModeler for the rmlib option? Anh-Dao On 7/16/14 3:17 PM, "Carson Holt" wrote: >No. You can provide both to MAKER. The options are model_org= and rmlib=. > By letting MAKER handle repeat masking it will differentiate repeat types >and use soft masking for some and hard masking for others. This increases >sensitivity of evidence alignments while still maintaining specificity. > >--Carson > > > >On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>files for Augustus. >>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>two >>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>novo and ~ 4% repeats via known options. As I understood, RM inside of >>MAKER uses only RepBase repeat library and RepeatRunner protein database. >> >>Anh-Dao >> >> >>On 7/16/14 2:36 PM, "Carson Holt" wrote: >> >>>When you ran Augustus separately, it should have created the parameters >>>needed to run it. Now you should be able to run it inside of MAKER >>>using >>>the species name you just created. >>> >>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>giving it the results as GFF3. >>> >>>--Carson >>> >>> >>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>Thanks Daniel for your quick response. >>>> >>>>I did not use the parameter file of other organism when running >>>>Augustus. >>>>I created the parameter file for the genome following their >>>>instructions. >>>>There were multiple steps to train and run Augustus (Creating gene >>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>As I mentioned the reason why I ran Augustus separately, because >>>>Augustus >>>>has not trained that genome (no parameter file exists). Otherwise I >>>>would >>>>run Augustus inside MAKER. >>>> >>>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>>I >>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>files, separated by comma? >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>> >>>>>Hi Anh-Dao, >>>>> >>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>evidence. You?ll put all of your fasta est files together in a command >>>>>separated list in the ?est" option, and all of your fasta protein >>>>>files >>>>>in a command separated list for the ?protein? option. >>>>> >>>>>You?ll specify the SNAP and Genemark files in their respective options >>>>>in >>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>?pred_gff? option. >>>>> >>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>maker with the ?rm_gff? option. >>>>> >>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>fasta >>>>>output, so you would put that in the ?est? option, along with all the >>>>>other est fasta files. >>>>> >>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>use >>>>>another organism that augustus is already trained for. The list of >>>>>species that augustus has parameter files for is in the README.txt >>>>>that >>>>>came with Augustus. I really recommend that you run Augustus from >>>>>inside >>>>>maker, because then you get all the benefits of maker passing >>>>>ext-based >>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>predictive ability. >>>>> >>>>>When you ran the augustus gene prediction separately, did you use >>>>>another >>>>>organism?s parameter file? >>>>> >>>>>Thanks, >>>>>Daniel >>>>> >>>>> >>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I would like to conduct a genome annotation and have the following >>>>>>data: >>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>options) >>>>>> - ESTs and RACE (fasta) >>>>>> - proteins (fasta) >>>>>> - proteins of related organisms (fasta) >>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>ZFF >>>>>>format, etc. ) >>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>convert >>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>prediction separately, because the genome has never been trained for >>>>>>Augustus. >>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>> >>>>>> Could you please let me know how can I specify parameters in the >>>>>>maker_opts.ctl file? >>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>> >>>>>> Thanks. >>>>>> Anh-Dao >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>> >>>> >>>>_______________________________________________ >>>>maker-devel mailing list >>>>maker-devel at box290.bluehost.com >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> > > From carsonhh at gmail.com Wed Jul 16 13:32:02 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:32:02 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: 'all' will use the whole of RepBase, or you can do 'metazoa' like your previous run. Then provide the RepeatModeler file to rmlib= --Carson On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >By default, model_org=all. Can I use the de novo repeat library predicted >by RepeatModeler for the rmlib option? > >Anh-Dao > > > >On 7/16/14 3:17 PM, "Carson Holt" wrote: > >>No. You can provide both to MAKER. The options are model_org= and >>rmlib=. >> By letting MAKER handle repeat masking it will differentiate repeat >>types >>and use soft masking for some and hard masking for others. This >>increases >>sensitivity of evidence alignments while still maintaining specificity. >> >>--Carson >> >> >> >>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>files for Augustus. >>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>two >>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>database. >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>> >>>>When you ran Augustus separately, it should have created the parameters >>>>needed to run it. Now you should be able to run it inside of MAKER >>>>using >>>>the species name you just created. >>>> >>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>giving it the results as GFF3. >>>> >>>>--Carson >>>> >>>> >>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>Thanks Daniel for your quick response. >>>>> >>>>>I did not use the parameter file of other organism when running >>>>>Augustus. >>>>>I created the parameter file for the genome following their >>>>>instructions. >>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>Augustus >>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>would >>>>>run Augustus inside MAKER. >>>>> >>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>(sure >>>>>I >>>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>>files, separated by comma? >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>> >>>>>>Hi Anh-Dao, >>>>>> >>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>command >>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>files >>>>>>in a command separated list for the ?protein? option. >>>>>> >>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>options >>>>>>in >>>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>>?pred_gff? option. >>>>>> >>>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>>maker with the ?rm_gff? option. >>>>>> >>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>fasta >>>>>>output, so you would put that in the ?est? option, along with all the >>>>>>other est fasta files. >>>>>> >>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>use >>>>>>another organism that augustus is already trained for. The list of >>>>>>species that augustus has parameter files for is in the README.txt >>>>>>that >>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>inside >>>>>>maker, because then you get all the benefits of maker passing >>>>>>ext-based >>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>predictive ability. >>>>>> >>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>another >>>>>>organism?s parameter file? >>>>>> >>>>>>Thanks, >>>>>>Daniel >>>>>> >>>>>> >>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>data: >>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>options) >>>>>>> - ESTs and RACE (fasta) >>>>>>> - proteins (fasta) >>>>>>> - proteins of related organisms (fasta) >>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>>ZFF >>>>>>>format, etc. ) >>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>convert >>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>prediction separately, because the genome has never been trained for >>>>>>>Augustus. >>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>> >>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>maker_opts.ctl file? >>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>> >>>>>>> Thanks. >>>>>>> Anh-Dao >>>>>>> >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>> >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >>> >> >> > From nguyenan at mail.nih.gov Thu Jul 17 08:19:34 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Thu, 17 Jul 2014 14:19:34 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I am not sure which fgenesh executable file should I use. fgenesh= #location of fgenesh executable When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need to specify a list of other executable programs (such as ppd, ppdn+, etc) Anh-Dao On 7/16/14 3:32 PM, "Carson Holt" wrote: >'all' will use the whole of RepBase, or you can do 'metazoa' like your >previous run. Then provide the RepeatModeler file to rmlib= > >--Carson > > > >On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>By default, model_org=all. Can I use the de novo repeat library predicted >>by RepeatModeler for the rmlib option? >> >>Anh-Dao >> >> >> >>On 7/16/14 3:17 PM, "Carson Holt" wrote: >> >>>No. You can provide both to MAKER. The options are model_org= and >>>rmlib=. >>> By letting MAKER handle repeat masking it will differentiate repeat >>>types >>>and use soft masking for some and hard masking for others. This >>>increases >>>sensitivity of evidence alignments while still maintaining specificity. >>> >>>--Carson >>> >>> >>> >>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>files for Augustus. >>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>two >>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>de >>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>database. >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>> >>>>>When you ran Augustus separately, it should have created the >>>>>parameters >>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>using >>>>>the species name you just created. >>>>> >>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>giving it the results as GFF3. >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>> wrote: >>>>> >>>>>>Thanks Daniel for your quick response. >>>>>> >>>>>>I did not use the parameter file of other organism when running >>>>>>Augustus. >>>>>>I created the parameter file for the genome following their >>>>>>instructions. >>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>Augustus >>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>would >>>>>>run Augustus inside MAKER. >>>>>> >>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>(sure >>>>>>I >>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>.gff3 >>>>>>files, separated by comma? >>>>>> >>>>>>Anh-Dao >>>>>> >>>>>> >>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>> >>>>>>>Hi Anh-Dao, >>>>>>> >>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>command >>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>files >>>>>>>in a command separated list for the ?protein? option. >>>>>>> >>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>options >>>>>>>in >>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>the >>>>>>>?pred_gff? option. >>>>>>> >>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>to >>>>>>>maker with the ?rm_gff? option. >>>>>>> >>>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>fasta >>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>the >>>>>>>other est fasta files. >>>>>>> >>>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>>use >>>>>>>another organism that augustus is already trained for. The list of >>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>that >>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>inside >>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>ext-based >>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>predictive ability. >>>>>>> >>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>another >>>>>>>organism?s parameter file? >>>>>>> >>>>>>>Thanks, >>>>>>>Daniel >>>>>>> >>>>>>> >>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>>data: >>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>options) >>>>>>>> - ESTs and RACE (fasta) >>>>>>>> - proteins (fasta) >>>>>>>> - proteins of related organisms (fasta) >>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>to >>>>>>>>ZFF >>>>>>>>format, etc. ) >>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>convert >>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>for >>>>>>>>Augustus. >>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>> >>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>maker_opts.ctl file? >>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Anh-Dao >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>. >>>>>>>>o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>> >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>>> >>> >>> >> > > From carsonhh at gmail.com Fri Jul 18 11:04:09 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 18 Jul 2014 11:04:09 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: It should just be 'fgenesh'. If it's not there you can still just give the GFF3. --Carson On 7/17/14, 8:19 AM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I am not sure which fgenesh executable file should I use. > >fgenesh= #location of fgenesh executable > >When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need >to specify a list of other executable programs (such as ppd, ppdn+, etc) > >Anh-Dao > > >On 7/16/14 3:32 PM, "Carson Holt" wrote: > >>'all' will use the whole of RepBase, or you can do 'metazoa' like your >>previous run. Then provide the RepeatModeler file to rmlib= >> >>--Carson >> >> >> >>On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>By default, model_org=all. Can I use the de novo repeat library >>>predicted >>>by RepeatModeler for the rmlib option? >>> >>>Anh-Dao >>> >>> >>> >>>On 7/16/14 3:17 PM, "Carson Holt" wrote: >>> >>>>No. You can provide both to MAKER. The options are model_org= and >>>>rmlib=. >>>> By letting MAKER handle repeat masking it will differentiate repeat >>>>types >>>>and use soft masking for some and hard masking for others. This >>>>increases >>>>sensitivity of evidence alignments while still maintaining specificity. >>>> >>>>--Carson >>>> >>>> >>>> >>>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>>files for Augustus. >>>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>>two >>>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>>de >>>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>>database. >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>>> >>>>>>When you ran Augustus separately, it should have created the >>>>>>parameters >>>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>>using >>>>>>the species name you just created. >>>>>> >>>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>>giving it the results as GFF3. >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>>> wrote: >>>>>> >>>>>>>Thanks Daniel for your quick response. >>>>>>> >>>>>>>I did not use the parameter file of other organism when running >>>>>>>Augustus. >>>>>>>I created the parameter file for the genome following their >>>>>>>instructions. >>>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>>structures for training AUGUSTUS with CEGMA => parameter file will >>>>>>>be >>>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>>Augustus >>>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>>would >>>>>>>run Augustus inside MAKER. >>>>>>> >>>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>>(sure >>>>>>>I >>>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>>.gff3 >>>>>>>files, separated by comma? >>>>>>> >>>>>>>Anh-Dao >>>>>>> >>>>>>> >>>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>>> >>>>>>>>Hi Anh-Dao, >>>>>>>> >>>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>>command >>>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>>files >>>>>>>>in a command separated list for the ?protein? option. >>>>>>>> >>>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>>options >>>>>>>>in >>>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>>the >>>>>>>>?pred_gff? option. >>>>>>>> >>>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>>to >>>>>>>>maker with the ?rm_gff? option. >>>>>>>> >>>>>>>>If you?ve converted the cufflinks output to gff3, you can give it >>>>>>>>to >>>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>>fasta >>>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>>the >>>>>>>>other est fasta files. >>>>>>>> >>>>>>>>If Augustus isn?t trained for your particular organism, then you >>>>>>>>can >>>>>>>>use >>>>>>>>another organism that augustus is already trained for. The list of >>>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>>that >>>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>>inside >>>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>>ext-based >>>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>>predictive ability. >>>>>>>> >>>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>>another >>>>>>>>organism?s parameter file? >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Daniel >>>>>>>> >>>>>>>> >>>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I would like to conduct a genome annotation and have the >>>>>>>>>following >>>>>>>>>data: >>>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>>options) >>>>>>>>> - ESTs and RACE (fasta) >>>>>>>>> - proteins (fasta) >>>>>>>>> - proteins of related organisms (fasta) >>>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>>to >>>>>>>>>ZFF >>>>>>>>>format, etc. ) >>>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>>convert >>>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>>for >>>>>>>>>Augustus. >>>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>>> >>>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>>maker_opts.ctl file? >>>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Anh-Dao >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> maker-devel mailing list >>>>>>>>> maker-devel at box290.bluehost.com >>>>>>>>> >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la >>>>>>>>>b >>>>>>>>>. >>>>>>>>>o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>> >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> > From jp.oeyen at uni-bonn.de Mon Jul 28 06:22:25 2014 From: jp.oeyen at uni-bonn.de (Jan Philip Oeyen) Date: Mon, 28 Jul 2014 14:22:25 +0200 Subject: [maker-devel] Forks.pm error when running maker with dsindex Message-ID: Hi all, we are currently having some unexpected errors when running maker on a genome which is split in several parts. Our cluster admin reported the following error message: Argument "ALRM" isn't numeric in exit at /share/scientific_bin/perlmodu les/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 2188. SIGTERM received SIGTERM received SIGTERM received We were using maker with the '-g' option on a single genome which is split into 20 parts, where 19 parts are equally large and the last contains about 20 sequences more. After that we ran Maker using dsindex to clean up the output. We are currently using maker v2.31 on 4 threads and forks v0.34. If any further info is needed to clarify the problem, please let me know and I will provide as much as possible. Thank you for your help! Best regards, Jan Philip Oeyen ZFMK // ZMB // University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mphoeppner at gmail.com Wed Jul 30 04:44:36 2014 From: mphoeppner at gmail.com (=?iso-8859-1?Q?Marc_H=F6ppner?=) Date: Wed, 30 Jul 2014 12:44:36 +0200 Subject: [maker-devel] Maker GFF output with features of 0 length Message-ID: <5C45F418-018B-4ACC-B682-E5659DB7F102@gmail.com> Hi, I?ve - more by accident - found that many of the gene builds I have generated with Maker (2.31.3) contain features with identical start and stop positions. For example: scaffold_2927 maker CDS 13013 13013 . + 1 ID=maker-scaffold_2927-augustus-gene-0.8-mRNA-1:cds;Parent=maker-scaffold_2927-augustus-gene-0.8-mRNA-1 This occurs seemingly randomly for all sorts of feature types and I have only seen this when running Maker on full assemblies. Before I start turning every stone, any ideas about possible explanations for this phenomenon? Is this likely some MPI-related communication issue, or NFS problems with synching data? Maker runs fine on our system, but that doesn?t mean that there aren?t any cryptic issues that only on these occasions read their head? Regarding the frequency, out of 450.000 GFF lines, 270 were affected in the case that I looked into the most. So it is pretty rare, but still... I am currently using Maker with openmpi-1.7.4 and the file system is mounter of NFS4 and IPoIB. I now switched to Maker 2.31.6, but have no strong reason to suspect that this will make a difference. Regards, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 3 08:12:07 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:12:07 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: The hints used by MAKER are CDSpart, exonpart, intronpart, and intron. You can play around with the extrinsic evidence configuration file if you want, but it's really not well documented, so I won't be able to provide much support. Thanks, Carson On 7/1/14, 6:31 AM, "Marc H?ppner" wrote: >Hi, > >sorry for resurrecting this topic. The issue was about the use of >ab-intio predictions and artefacts in the final maker gene builds. > >I think one potential issue that hasn?t been discussed here concerns >Makers? use of the extrinsic config file when running Augustus. This file >controls the ?weights? of different types of hints when running Augustus. >I don?t think it is made clear anywhere which extrinsic config file Maker >reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it >suggested that it would be useful to manipulate this file to improve >augustus performance (and in extension Makers performance). Finally, I am >not entirely sure which sorts of hints Maker creates for Augustus and to >which hint categories these would belong to (i.e. it makes no sense to >tweak the intronpart malus factor if Maker does not create such hints). >Perhaps it would be good to elaborate on that in the Maker documentation, >since it seems to be quite relevant for obtaining better results. Or does >such an explanation already exist somewhere? > > >/Marc > >Marc P. Hoeppner, PhD >Team Leader >BILS Genome Annotation Platform >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 05 Jun 2014, at 20:28, Carson Holt wrote: > >> One thing you might want to try is adding another predictor like SNAP >> together with Augustus and then process the MAKER results using EVM. We >> actually have a collaboration with the EVM group to produce a MAKER-EVM >> pipeline (MAKER 3.0). EVM will produce consensus models using the >> predictions and the evidence in the MAKER GFF3 which are generally >>better >> than just SNAP and Augustus with hints, so it might be able to remove >>some >> of the artifacts you are worried about. >> >> --Carson >> >> >> >> On 6/5/14, 12:24 PM, "Carson Holt" wrote: >> >>> Like I said. The predictors do the best they can, so there is probably >>> something about the regions to make the CDS, reading frame, or >>>start/stop >>> work that requires exons to be dropped or added. In several ant >>>genomes >>> we saw something like this caused by incorrect homopolymers in the >>> assembly which force the predictor to slightly alter the intron/exon >>> structure because otherwise the reading frame made no sense (the EST >>> alignments were used to confirmed that the assembly homopolymers were >>> incorrect - lots of bad single base pair deletions). >>> >>> The way hints work is as follows. At the simplest level ab initio >>> predictors are calculating the probability of being in different states >>> (intergenic, intron, exon, etc.). The hints increase the probability >>>of >>> being in the intron state where MAKER gives an intron hint or being in >>>an >>> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >>> likelihood of the ab intio gene predictor to call something similar in >>> structure to the evidence overlapping it. That being said, if there is >>> strong enough signal from something else in the sequence or my hints >>>won't >>> work with the splice sites in the region or the reading frame breaks, >>>then >>> no amount of hints can force augustus to make that model. >>> >>> --Carson >>> >>> >>> >>> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> thanks for the feedback. I spent some more time on this and am still >>>> somewhat unsatisfied with the whole thing? >>>> >>>> A few comments: >>>> >>>> I quite frequently see augustus and in extension Maker including exons >>>> that are not supported by EST/Protein evidence and are not critical >>>>for >>>> the gene model (i.e. I can take them out and still get a proper CDS). >>>> Maybe I don?t know enough about how Maker creates hints and more >>>> importantly what role these hints play for augustus, but I cannot >>>>really >>>> see a great effect (any, really) on the gene models even if both ESTs >>>>and >>>> proteins contradict an augustus gene model and the surplus exon is >>>> non-essential. >>>> >>>> (all evidence is provided as fasta files, protein2genome and >>>>est2genome >>>> are set to 0) >>>> >>>> As for the repeat library, I suppose this is a critical point. I am >>>>using >>>> repeats from a closely related species via Repeatmasker, modelled and >>>> filtered repeats from RepeatModeler and repeats derived from a >>>> high-coverage 454 data set. Not sure what else I can do to improve >>>>that. >>>> >>>> As for evidence, I am using the curated reference proteome from a >>>>related >>>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>>> reads. I don?t think it gets a whole lot better, in terms of what data >>>> can be used. >>>> >>>> So in summary, I just don?t get where I want to using Augustus and >>>>Maker >>>> - specifically, the gene models are full of weird, unsupported >>>>artefacts >>>> despite manually curating > 850 models for training. I suppose I was >>>> hoping for some secret trick to improve on this - but I guess there is >>>> none? Actually, if I only do a pure evidence build (seeing that my >>>>input >>>> data is very high quality), it looks better - which sort of goes >>>>against >>>> the premise of Maker :/ >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> On 27 May 2014, at 17:25, Carson Holt wrote: >>>> >>>>> Extra exons can be required for predictors to make sense of a region >>>>> (they >>>>> do the best they can). This can be due to imperfect assemblies or >>>>> repeats. For plants the repeat database is the the one thing that >>>>>will >>>>> most affect the annotation quality. You may need to spend some time >>>>> building the best repeat library you can. The repeat library is the >>>>> next >>>>> most important thing next to training the predictor, because they >>>>> confuse >>>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>>> regions (because repeats do encode real protein and protein >>>>>fragments). >>>>> Also when running now with MAKER make sure to include the entire >>>>> proteome >>>>> of a related species and not just UniProt, and you will get better >>>>> performance. Now that you have Augustus trained, using it inside of >>>>> MAKER >>>>> with an improved repeat library and additional protein evidence >>>>>should >>>>> give it the feedback that will allow it to perform better than it >>>>>would >>>>> with just naked ab initio prediction. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I wanted to get some feedback regarding the training of ab-initio >>>>>>gene >>>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>>> many >>>>>> people on this list that have encountered and solved this issue in >>>>>>one >>>>>> way or another. >>>>>> >>>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for >>>>>>a >>>>>> plant genome. This has always been a very frustrating process for >>>>>>me, >>>>>> but >>>>>> while I have a better idea now how to do it, I still don?t get the >>>>>> sort >>>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>>> >>>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>>> Sanger-sequenced EST data >>>>>> >>>>>> Filtered for Models with an AED <= 0.3 >>>>>> >>>>>> Loaded that into WebApollo, together with an existing reference >>>>>> annotation and the evidence tracks >>>>>> >>>>>> Manually curated/selected 750 gene models using the following rules: >>>>>> - Must have start/stop codon >>>>>> - Most have as many exons as possible >>>>>> - Must agree with evidence >>>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>>> regions for augustus to train intergenic sequence) >>>>>> >>>>>> From these models, I created a GBK file, split it into 650 (train) >>>>>> and >>>>>> 100 (test) models and created a new profile using the documented >>>>>> procedure. >>>>>> >>>>>> But: >>>>>> >>>>>> While the naked ab-init models created through maker get a lot of >>>>>> genes >>>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>>> Problems include: >>>>>> >>>>>> - random exon calls which are not supported by any line of evidence >>>>>> (~1 >>>>>> per gene model, I would guess) >>>>>> - poor congruency with some gene models (especially ones not used >>>>>>for >>>>>> training/testing) >>>>>> >>>>>> Is there any best-practice guide on how to improve this? The >>>>>>Augustus >>>>>> website is unfortunately quite poor on detail? My impression so far >>>>>>is >>>>>> that ramping up the number of training models isn?t really doing too >>>>>> much >>>>>> beyond a certain point (tried 400, 500 and 750). >>>>>> >>>>>> Regards, >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> Marc P. Hoeppner, PhD >>>>>> Team Leader >>>>>> BILS Genome Annotation Platform >>>>>> Department for Medical Biochemistry and Microbiology >>>>>> Uppsala University, Sweden >>>>>> marc.hoeppner at bils.se >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>rg >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From marc.hoeppner at bils.se Tue Jul 1 06:31:33 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Tue, 1 Jul 2014 14:31:33 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Hi, sorry for resurrecting this topic. The issue was about the use of ab-intio predictions and artefacts in the final maker gene builds. I think one potential issue that hasn?t been discussed here concerns Makers? use of the extrinsic config file when running Augustus. This file controls the ?weights? of different types of hints when running Augustus. I don?t think it is made clear anywhere which extrinsic config file Maker reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it suggested that it would be useful to manipulate this file to improve augustus performance (and in extension Makers performance). Finally, I am not entirely sure which sorts of hints Maker creates for Augustus and to which hint categories these would belong to (i.e. it makes no sense to tweak the intronpart malus factor if Maker does not create such hints). Perhaps it would be good to elaborate on that in the Maker documentation, since it seems to be quite relevant for obtaining better results. Or does such an explanation already exist somewhere? /Marc Marc P. Hoeppner, PhD Team Leader BILS Genome Annotation Platform Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 05 Jun 2014, at 20:28, Carson Holt wrote: > One thing you might want to try is adding another predictor like SNAP > together with Augustus and then process the MAKER results using EVM. We > actually have a collaboration with the EVM group to produce a MAKER-EVM > pipeline (MAKER 3.0). EVM will produce consensus models using the > predictions and the evidence in the MAKER GFF3 which are generally better > than just SNAP and Augustus with hints, so it might be able to remove some > of the artifacts you are worried about. > > --Carson > > > > On 6/5/14, 12:24 PM, "Carson Holt" wrote: > >> Like I said. The predictors do the best they can, so there is probably >> something about the regions to make the CDS, reading frame, or start/stop >> work that requires exons to be dropped or added. In several ant genomes >> we saw something like this caused by incorrect homopolymers in the >> assembly which force the predictor to slightly alter the intron/exon >> structure because otherwise the reading frame made no sense (the EST >> alignments were used to confirmed that the assembly homopolymers were >> incorrect - lots of bad single base pair deletions). >> >> The way hints work is as follows. At the simplest level ab initio >> predictors are calculating the probability of being in different states >> (intergenic, intron, exon, etc.). The hints increase the probability of >> being in the intron state where MAKER gives an intron hint or being in an >> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >> likelihood of the ab intio gene predictor to call something similar in >> structure to the evidence overlapping it. That being said, if there is >> strong enough signal from something else in the sequence or my hints won't >> work with the splice sites in the region or the reading frame breaks, then >> no amount of hints can force augustus to make that model. >> >> --Carson >> >> >> >> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> thanks for the feedback. I spent some more time on this and am still >>> somewhat unsatisfied with the whole thing? >>> >>> A few comments: >>> >>> I quite frequently see augustus and in extension Maker including exons >>> that are not supported by EST/Protein evidence and are not critical for >>> the gene model (i.e. I can take them out and still get a proper CDS). >>> Maybe I don?t know enough about how Maker creates hints and more >>> importantly what role these hints play for augustus, but I cannot really >>> see a great effect (any, really) on the gene models even if both ESTs and >>> proteins contradict an augustus gene model and the surplus exon is >>> non-essential. >>> >>> (all evidence is provided as fasta files, protein2genome and est2genome >>> are set to 0) >>> >>> As for the repeat library, I suppose this is a critical point. I am using >>> repeats from a closely related species via Repeatmasker, modelled and >>> filtered repeats from RepeatModeler and repeats derived from a >>> high-coverage 454 data set. Not sure what else I can do to improve that. >>> >>> As for evidence, I am using the curated reference proteome from a related >>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>> reads. I don?t think it gets a whole lot better, in terms of what data >>> can be used. >>> >>> So in summary, I just don?t get where I want to using Augustus and Maker >>> - specifically, the gene models are full of weird, unsupported artefacts >>> despite manually curating > 850 models for training. I suppose I was >>> hoping for some secret trick to improve on this - but I guess there is >>> none? Actually, if I only do a pure evidence build (seeing that my input >>> data is very high quality), it looks better - which sort of goes against >>> the premise of Maker :/ >>> >>> Regards, >>> >>> Marc >>> >>> >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> On 27 May 2014, at 17:25, Carson Holt wrote: >>> >>>> Extra exons can be required for predictors to make sense of a region >>>> (they >>>> do the best they can). This can be due to imperfect assemblies or >>>> repeats. For plants the repeat database is the the one thing that will >>>> most affect the annotation quality. You may need to spend some time >>>> building the best repeat library you can. The repeat library is the >>>> next >>>> most important thing next to training the predictor, because they >>>> confuse >>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>> regions (because repeats do encode real protein and protein fragments). >>>> Also when running now with MAKER make sure to include the entire >>>> proteome >>>> of a related species and not just UniProt, and you will get better >>>> performance. Now that you have Augustus trained, using it inside of >>>> MAKER >>>> with an improved repeat library and additional protein evidence should >>>> give it the feedback that will allow it to perform better than it would >>>> with just naked ab initio prediction. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>> >>>>> Hi, >>>>> >>>>> I wanted to get some feedback regarding the training of ab-initio gene >>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>> many >>>>> people on this list that have encountered and solved this issue in one >>>>> way or another. >>>>> >>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>>> plant genome. This has always been a very frustrating process for me, >>>>> but >>>>> while I have a better idea now how to do it, I still don?t get the >>>>> sort >>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>> >>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>> Sanger-sequenced EST data >>>>> >>>>> Filtered for Models with an AED <= 0.3 >>>>> >>>>> Loaded that into WebApollo, together with an existing reference >>>>> annotation and the evidence tracks >>>>> >>>>> Manually curated/selected 750 gene models using the following rules: >>>>> - Must have start/stop codon >>>>> - Most have as many exons as possible >>>>> - Must agree with evidence >>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>> regions for augustus to train intergenic sequence) >>>>> >>>>> From these models, I created a GBK file, split it into 650 (train) >>>>> and >>>>> 100 (test) models and created a new profile using the documented >>>>> procedure. >>>>> >>>>> But: >>>>> >>>>> While the naked ab-init models created through maker get a lot of >>>>> genes >>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>> Problems include: >>>>> >>>>> - random exon calls which are not supported by any line of evidence >>>>> (~1 >>>>> per gene model, I would guess) >>>>> - poor congruency with some gene models (especially ones not used for >>>>> training/testing) >>>>> >>>>> Is there any best-practice guide on how to improve this? The Augustus >>>>> website is unfortunately quite poor on detail? My impression so far is >>>>> that ramping up the number of training models isn?t really doing too >>>>> much >>>>> beyond a certain point (tried 400, 500 and 750). >>>>> >>>>> Regards, >>>>> >>>>> Marc >>>>> >>>>> >>>>> Marc P. Hoeppner, PhD >>>>> Team Leader >>>>> BILS Genome Annotation Platform >>>>> Department for Medical Biochemistry and Microbiology >>>>> Uppsala University, Sweden >>>>> marc.hoeppner at bils.se >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From rajesh.bommareddy at tu-harburg.de Thu Jul 3 08:45:59 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 03 Jul 2014 16:45:59 +0200 Subject: [maker-devel] Maker output Message-ID: <53B56CA7.80108@tu-harburg.de> Dear Maker group I have run the example files provided with maker. But i am unable to understand the output. Where can i find the information about exons, CDS, protein sequence of the predicted CDS or mRNA and the predicted protein name for each contig? Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Thu Jul 3 08:51:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:51:57 -0600 Subject: [maker-devel] Maker output In-Reply-To: <53B56CA7.80108@tu-harburg.de> References: <53B56CA7.80108@tu-harburg.de> Message-ID: See the MAKER 2014 GMOD tutorial --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_ GMOD_Online_Training_2014 Also watch accompanying video --> http://youtu.be/uA96tSSaqLk Results will be in GFF3 and FASTA format. The GFF3 file contains the location of structure relative to the assembly (exon/CDS/UTR). The FASTA file contains the sequence (transcript/protein). There will be separate files for each contig. Use gff3_merge and fasta_merge to generate merged genome wide GFF3 and FASTA files. An explanation of GFF3 format is here --> http://www.sequenceontology.org/gff3.shtml Thanks, Carson On 7/3/14, 8:45 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Maker group > >I have run the example files provided with maker. But i am unable to >understand the output. Where can i find the information about exons, >CDS, protein sequence of the predicted CDS or mRNA and the predicted >protein name for each contig? > > >Thanks and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From dence at genetics.utah.edu Mon Jul 7 08:24:33 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 14:24:33 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: <8219A0C0-DBB0-4417-8B4F-39D6D7F93B93@genetics.utah.edu> Hi Saad, I think that's correct. As a sub step for each of the steps you listed, I would also choose one or two large scaffolds out of your assembly to use as a test set and use that test set to make sure that all you are getting output like you'd expect, before running MAKER on the whole genome. Let me know if there's anything else we can do to help. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 7:08 AM, Saad Arif > wrote: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt > wrote: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence > Cc: ">" > Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 09:26:05 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 11:26:05 -0400 Subject: [maker-devel] Couple quick questions about Maker Message-ID: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon Jul 7 10:00:45 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 16:00:45 +0000 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: Message-ID: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [LinkedIn] [http://ws-stats.appspot.com/ga/pixel.png?yes__count=true%20&e=legacy_impression] _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 10:26:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 10:26:43 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff option (which is pretty different). Also If you provide GFF# files for repeats, you will still need to turn of repeat masking in the control files by blanking out the options. Also MAKER uses a step called RepeatRunner against an internal transposable element protein databases which is probably still running (and is slow because it's a search in translated protein space). For performance, you may want to give a larger max_dna_len for the MAKER run given that you have a large RAM machine. Also set all the depth_blast in maker_bopts.ctl to 15 or 20. CEGMA is convenient for training predictors because it finds genes that will always be in every eukaryote (I.e. high confidence). You can combine these with est2genome/protein2genome results from MAKER if you want. You can then use the resulting HMM for a larger MAKER run with experimental evidence, and then train again on those results. But beware than there is rarely any benefit from training beyond that second round. More training actually tends to makes things worse (the overtraining paradox). --Carson From: Daniel Ence Date: Monday, July 7, 2014 at 10:00 AM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue wrote: > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome assembly). > Both these genomes are a little fragmented, so there are lots of contigs, > which slows down the whole process. I am looking for ways to speed things up > and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST and > protein evidence to build models. Had already done RepeatMasking so thought > I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so > two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one > that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should > generally allow the program to bypass the RepeatMasking step, correct? Does it > also make it bypass the Repeat ORF searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step > SNAP training from the tutorials seems straightforward, but I was wondering > about the Augustus step. From what I can tell, simply providing an Augustus > "trained" species name should turn on Augustus and blast/blat-like hints > generated within Maker are then used in gene prediction. Any thoughts on if > it's either more accurate or faster to do the Augustus predictions outside of > the Maker pipeline and then import them using the pred_gff parameter in the > maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files to > zff file for snap training? Currently, I am using predicted transcript for > this species and protein sequences from related species to training. Does > anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything > in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > Nathaniel Jue, Ph.D. > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > > iel-jue%2F1%2F531%2F176%2F&sn=> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 11:21:50 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 13:21:50 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 11:26:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 11:26:34 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just ^C. If you change options, then it will restart at a point determined by what will be affected by the change. Since repeat masking affects everything downstream, everything will start from zero. If it was a step like changing the HMM or altering blastn_depth, then it would be less disruptive and MAKER could reuse all existing raw reports. Unfortunately it's not that way for altering repeat masking options. --Carson From: Nathaniel Jue Date: Monday, July 7, 2014 at 11:21 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Tue Jul 8 09:56:37 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Tue, 8 Jul 2014 11:56:37 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 8 10:31:40 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 08 Jul 2014 10:31:40 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Convert them both to ZFF, then concatenate the ZFF and sequence files. --Carson From: Nathaniel Jue Date: Tuesday, July 8, 2014 at 9:56 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Mon Jul 7 07:08:53 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Mon, 7 Jul 2014 15:08:53 +0200 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt wrote: > "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" > > Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). > > If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > --Carson > > > > From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM > To: Daniel Ence > Cc: "" > Subject: Re: [maker-devel] Help with updating an annotation > > Thanks Daniel. I think it's more clear to me now. > > So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? > > Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. > > As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? > > Let me know if i'm still missing something here. > > Thanks in advance. > > best, > Saad > On 18 Jun 2014, at 17:21, Daniel Ence wrote: > >> Hi Saad, >> >> Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). >> >> You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. >> >> One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> >> Let me know if that helps, or if you have more question >> >> >> ~Daniel >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jun 18, 2014, at 5:09 AM, Saad Arif >> wrote: >> >>> Thank you for the response. I still have one question though, with these options: >>> >>> est_GFF=cufflinksout.GFF >>> >>> modle_GFF= ensembl reference.GFF >>> >>> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >>> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >>> >>> Any feedback on this would be greatly appreciated. >>> >>> saad >>> >>> On 13 Jun 2014, at 17:59, Carson Holt wrote: >>> >>>> Use the cufflinks instead of the tophat features (tophat tends to be >>>> really noisy). Give the existing models to model_gff (they will then >>>> always be kept unless something better is found). There is no option to >>>> keep models and then just add isoforms. The model_gff input will either >>>> be kept as is (unchanged), or replaced with an updated model suggested by >>>> the evidence (the updated model may contain multiple isoforms though), and >>>> map_forward=1 can be used to pull names forward from the old model onto >>>> the new models. >>>> >>>> Thansk, >>>> Carson >>>> >>>> >>>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>>> >>>>> Dear All, >>>>> >>>>> I would like to use Maker pipeline to expand a current annotation (new >>>>> isoforms and novel genes with respect to current annotation) and was >>>>> wondering if anyone had experience with this and or suggestions to my >>>>> questions. >>>>> >>>>> Briefly: >>>>> >>>>> I have tophat splice junctions from RNAseq data or alternatively >>>>> cufflinks generated transcript models (fasts format) that i want to use >>>>> as my new data (est_gff or est). >>>>> >>>>> I want to provide the current Ensembl annotation for gene prediction but >>>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>>> should provide this annotation as pred_gff >>>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>>> annotation for both options (pred_ and mod_gff)? >>>>> >>>>> >>>>> >>>>> Importantly, my main goal is to use the new RNAseq data to add more >>>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>>> thoughts or suggestions on how to go about this would be sincerely >>>>> appreciated. >>>>> >>>>> >>>>> Thanks in advance, >>>>> saad >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 15:38:52 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:38:52 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 15:44:48 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:44:48 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Also you can use repeat_gff in the control files, by I prefer just to rerun in the same directory as the previous job. --Carson From: Carson Holt Date: Thursday, July 10, 2014 at 3:38 PM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 16:02:57 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 15:02:57 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup Message-ID: Hi experts, I am trying to install Maker in a new machine (running Mac OS 10.7.5), and have succeed so far except for the "./Build exonerate" step, which gives me the following error: checking for socklen_t... yes checking for pkg-config... no ERROR: Could not find pkg-config ... is glib-2 installed ??? Fink for 64-bit is installed, and via 'fink list', I confimed that glib2-dev and -shlibs are installed. I unistalled and re-installed both fink and glib2 several times, hoping it was a configuration problem, but still get stuck at this step. I found a few previous questions about this issue in this forum, but the solutions Carson provided were directed for OS 10.6 only, apparently, so I did not try these. I have run into the limit of what I know how to do with these compilations. I tried setting up Exonerate directly but it has trouble finding glib as well. Any suggestions? Thank you so much! -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 17:41:59 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 16:41:59 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup In-Reply-To: References: Message-ID: OK, before anyone spends too much of their time trying to help me... I think I was able to solve my issue above. What I did was to install an additional glib2-related package using fink install. I installed glibmm2.4-dev, which also installs glibmm2.4-shlib. These make up a C++ interface for the glib2 library, according to their description. Once I installed those packages, I re-ran ./Build exonerate and it seemed to work. I tried a exonerate command in Terminal and it recognized it OK. Hopefully what I did won't cause any issues down the line. Thanks. On Thu, Jul 10, 2014 at 3:02 PM, Felipe Barreto wrote: > Hi experts, > > I am trying to install Maker in a new machine (running Mac OS 10.7.5), and > have succeed so far except for the "./Build exonerate" step, which gives me > the following error: > > checking for socklen_t... yes > checking for pkg-config... no > ERROR: Could not find pkg-config ... is glib-2 installed ??? > > > Fink for 64-bit is installed, and via 'fink list', I confimed that > glib2-dev and -shlibs are installed. I unistalled and re-installed both > fink and glib2 several times, hoping it was a configuration problem, but > still get stuck at this step. > > I found a few previous questions about this issue in this forum, but the > solutions Carson provided were directed for OS 10.6 only, apparently, so I > did not try these. I have run into the limit of what I know how to do with > these compilations. > > I tried setting up Exonerate directly but it has trouble finding glib as > well. > > Any suggestions? > > Thank you so much! > -- > Felipe Barreto > Post-doctoral Scholar > Scripps Institution of Oceanography > University of California, San Diego > La Jolla, CA 92093 > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 11 05:56:03 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 11 Jul 2014 13:56:03 +0200 Subject: [maker-devel] (no subject) Message-ID: I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Jul 11 08:08:43 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 11 Jul 2014 14:08:43 +0000 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Hi Panos, 1) You'll only use est2genome and protein2genome for creating models that will be used for training the ab-initio predictors (like SNAP). Sometimes that means one run of MAKER for training; sometimes that means two runs of MAKER. You usually don't gain any accuracy after the second round of training. It's ok to use both EST and protein data for this training step. 2) If you're using both ESTs and protein sequence to train your ab-initio predictors, then both est2genome and protein2genome should be set to 1. 3) If you want to pass Blast results to MAKER, you'll need to pass those results as GFF3, but MAKER will install and run blast for you, and does a good job of keeping track of all those results and making them accessible to you in the end, so it's going to be a lot of work to do those blasts on your own outside of MAKER. I seriously suggest that you use blast internal to maker. Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ________________________________ From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos Ioannidis [panos.ioannidis at gmail.com] Sent: Friday, July 11, 2014 5:56 AM To: maker-devel Subject: [maker-devel] (no subject) I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 14 01:20:50 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 14 Jul 2014 09:20:50 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models > that will be used for training the ab-initio predictors (like SNAP). > Sometimes that means one run of MAKER for training; sometimes that means > two runs of MAKER. You usually don't gain any accuracy after the second > round of training. It's ok to use both EST and protein data for this > training step. > > 2) If you're using both ESTs and protein sequence to train your > ab-initio predictors, then both est2genome and protein2genome should be set > to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a > good job of keeping track of all those results and making them accessible > to you in the end, so it's going to be a lot of work to do those blasts on > your own outside of MAKER. I seriously suggest that you use blast internal > to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ------------------------------ > *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of > Panos Ioannidis [panos.ioannidis at gmail.com] > *Sent:* Friday, July 11, 2014 5:56 AM > *To:* maker-devel > *Subject:* [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of > questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So > for the first run I see that some people use only the ESTs and some others > use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess > that the ESTs will give better models, but for the cases where genes aren't > covered by an EST, it's okay to have a protein database to detect them as > well. Am I right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? > Should they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and > giving Maker directly the results. I guess that in this case, I'll have to > first convert the BLAST output to a gff3 file and give it to the > protein_gff parameter, right? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 08:46:50 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:46:50 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 08:49:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:49:33 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Also one more question. What is the exact error text you get for the forks error? Is it a forks.pm error or is it an MPI warn on fork error (which are actually very different). --Carson From: Carson Holt Date: Monday, July 14, 2014 at 8:46 AM To: Panos Ioannidis , Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m aker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 00:59:18 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 08:59:18 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm, because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the > forks error? Is it a forks.pm error or is it an MPI warn on fork error > (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence < > dence at genetics.utah.edu> > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. > The filtering and polishing done by MAKER is rather significant (direct > BLAST is actually worse with homology searches than many people realize). > > With respect to forks.pm, your admin most likely edited the wrong forks.pm. > There may be more than one on your system. If you let maker install some > prerequisites for you (because it requires a specific version of forks.pm), > it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify > the exact location of the forks.pm being used. Or if he is editing it as > part of the install tarball, his edits may actually be undone during the > installation procedure. > > Use this command line to identify the location of the forks.pm module > that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm" > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is > because I'm currently not able to run Maker on our cluster due to a problem > in the Perl "forks" library. And it looks like there isn't much I can do > about it; I tried Perlbrew but it doesn't work when I try to install > versions <5.18 (the problem in forks occurs on 5.18 and later versions). > Our admin also tried to change the code in the forks.pm file as per > Carson's suggestion in another thread, but that didn't work either... As a > result I'm running Maker on my workstation (really slooow) till a solution > is found and since BLAST is a time-consuming step I was thinking of running > it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence > wrote: > >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible >> to you in the end, so it's going to be a lot of work to do those blasts on >> your own outside of MAKER. I seriously suggest that you use blast internal >> to maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> ------------------------------ >> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >> Panos Ioannidis [panos.ioannidis at gmail.com] >> *Sent:* Friday, July 11, 2014 5:56 AM >> *To:* maker-devel >> *Subject:* [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of >> questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So >> for the first run I see that some people use only the ESTs and some others >> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >> that the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? >> Should they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos >> > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 07:58:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 07:58:20 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you are getting a segfault. It is more likely an MPI error especially if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that have bugs on forks and system calls. If it is OpenMPI, run the following command before launching MAKER --> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so Make sure to set replace openmpi_location with the location of your OpenMPI. Also add the following to your MPI command before running MAKER. --> -mca btl ^openib Example --> mpiexec -mca btl ^openib -n 40 maker If you are using MVAPICH2, then you need to switch to OpenMPI. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 12:59 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm , because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the forks > error? Is it a forks.pm error or is it an MPI warn on fork > error (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence > > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. The > filtering and polishing done by MAKER is rather significant (direct BLAST is > actually worse with homology searches than many people realize). > > With respect to forks.pm , your admin most likely edited the > wrong forks.pm . There may be more than one on your system. > If you let maker install some prerequisites for you (because it requires a > specific version of forks.pm ), it may be in > .../maker/perl/lib/forks.pm . Otherwise you have to > identify the exact location of the forks.pm being used. Or > if he is editing it as part of the install tarball, his edits may actually be > undone during the installation procedure. > > Use this command line to identify the location of the forks.pm > module that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm " > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is because > I'm currently not able to run Maker on our cluster due to a problem in the > Perl "forks" library. And it looks like there isn't much I can do about it; I > tried Perlbrew but it doesn't work when I try to install versions <5.18 (the > problem in forks occurs on 5.18 and later versions). Our admin also tried to > change the code in the forks.pm file as per Carson's > suggestion in another thread, but that didn't work either... As a result I'm > running Maker on my workstation (really slooow) till a solution is found and > since BLAST is a time-consuming step I was thinking of running it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible to >> you in the end, so it's going to be a lot of work to do those blasts on your >> own outside of MAKER. I seriously suggest that you use blast internal to >> maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >> Ioannidis [panos.ioannidis at gmail.com] >> Sent: Friday, July 11, 2014 5:56 AM >> To: maker-devel >> Subject: [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So for >> the first run I see that some people use only the ESTs and some others use >> ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that >> the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >> they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 08:03:12 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 16:03:12 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially > if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries > that have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your > OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll give it > another try later today. > > As for the error, it's just "Segmentation fault"! And we know this > segfault is because of forks.pm, because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just > the line "use forks;" and try to run it, you'll get a segfault. So it looks > like it's something pretty general and serious, and I'm really surprised I > can't find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > >> Also one more question. What is the exact error text you get for the >> forks error? Is it a forks.pm error or is it an MPI warn on fork error >> (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence < >> dence at genetics.utah.edu> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. >> The filtering and polishing done by MAKER is rather significant (direct >> BLAST is actually worse with homology searches than many people realize). >> >> With respect to forks.pm, your admin most likely edited the wrong >> forks.pm. There may be more than one on your system. If you let maker >> install some prerequisites for you (because it requires a specific version >> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >> have to identify the exact location of the forks.pm being used. Or if he >> is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm module >> that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm" >> >> You can even send me a copy of the file once it has been edited, and I >> can tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install >> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >> Our admin also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution >> is found and since BLAST is a time-consuming step I was thinking of running >> it separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >> wrote: >> >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models >>> that will be used for training the ab-initio predictors (like SNAP). >>> Sometimes that means one run of MAKER for training; sometimes that means >>> two runs of MAKER. You usually don't gain any accuracy after the second >>> round of training. It's ok to use both EST and protein data for this >>> training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your >>> ab-initio predictors, then both est2genome and protein2genome should be set >>> to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible >>> to you in the end, so it's going to be a lot of work to do those blasts on >>> your own outside of MAKER. I seriously suggest that you use blast internal >>> to maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> ------------------------------ >>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>> Panos Ioannidis [panos.ioannidis at gmail.com] >>> *Sent:* Friday, July 11, 2014 5:56 AM >>> *To:* maker-devel >>> *Subject:* [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of >>> questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>> Should they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and >>> giving Maker directly the results. I guess that in this case, I'll have to >>> first convert the BLAST output to a gff3 file and give it to the >>> protein_gff parameter, right? >>> >>> Thanks, >>> Panos >>> >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 08:10:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 08:10:24 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you don't have MPI support, it's not an issue, and your Seg fault is likely something else. Your reference to perl 5.18 and forks.pm should not be a segfault error either, and would not represent your error. The Perl 5.18/forks.pm is a different issue where perl actually tells itself to die because hash reshuffling isn't safe whereas segfaults are causes by binary corruption or incorrect memory access issues (very different issues). I'd actually recommend a full perl reinstall if you are getting segfaults, because it suggests a deeper issue. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 8:03 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially if > you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that > have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel > > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll > give it another try later today. > > As for the error, it's just "Segmentation fault"! And we know this segfault is > because of forks.pm , because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just the > line "use forks;" and try to run it, you'll get a segfault. So it looks like > it's something pretty general and serious, and I'm really surprised I can't > find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> Also one more question. What is the exact error text you get for the forks >> error? Is it a forks.pm error or is it an MPI warn on >> fork error (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence >> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. The >> filtering and polishing done by MAKER is rather significant (direct BLAST is >> actually worse with homology searches than many people realize). >> >> With respect to forks.pm , your admin most likely edited >> the wrong forks.pm . There may be more than one on your >> system. If you let maker install some prerequisites for you (because it >> requires a specific version of forks.pm ), it may be in >> .../maker/perl/lib/forks.pm . Otherwise you have to >> identify the exact location of the forks.pm being used. Or >> if he is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm >> module that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm " >> >> You can even send me a copy of the file once it has been edited, and I can >> tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install versions >> <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin >> also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution is >> found and since BLAST is a time-consuming step I was thinking of running it >> separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models that >>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>> that means one run of MAKER for training; sometimes that means two runs of >>> MAKER. You usually don't gain any accuracy after the second round of >>> training. It's ok to use both EST and protein data for this training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>> predictors, then both est2genome and protein2genome should be set to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible to >>> you in the end, so it's going to be a lot of work to do those blasts on your >>> own outside of MAKER. I seriously suggest that you use blast internal to >>> maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>> Ioannidis [panos.ioannidis at gmail.com] >>> Sent: Friday, July 11, 2014 5:56 AM >>> To: maker-devel >>> Subject: [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>> they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and giving >>> Maker directly the results. I guess that in this case, I'll have to first >>> convert the BLAST output to a gff3 file and give it to the protein_gff >>> parameter, right? >>> >>> Thanks, >>> Panos >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Wed Jul 16 06:26:56 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Wed, 16 Jul 2014 14:26:56 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm should > not be a segfault error either, and would not represent your error. The > Perl 5.18/forks.pm is a different issue where perl actually tells itself > to die because hash reshuffling isn't safe whereas segfaults are causes by > binary corruption or incorrect memory access issues (very different > issues). I'd actually recommend a full perl reinstall if you are getting > segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only > occurring when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > >> If you are getting a segfault. It is more likely an MPI error especially >> if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries >> that have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your >> OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel < >> maker-devel at yandell-lab.org> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! We'll give it >> another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this >> segfault is because of forks.pm, because if you remove the "use forks;" >> line script execution continues without segfault (till it crashes later for >> another reason, of course). In fact, even if you create a script with just >> the line "use forks;" and try to run it, you'll get a segfault. So it looks >> like it's something pretty general and serious, and I'm really surprised I >> can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> >>> Also one more question. What is the exact error text you get for the >>> forks error? Is it a forks.pm error or is it an MPI warn on fork error >>> (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence < >>> dence at genetics.utah.edu> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. >>> The filtering and polishing done by MAKER is rather significant (direct >>> BLAST is actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm, your admin most likely edited the wrong >>> forks.pm. There may be more than one on your system. If you let maker >>> install some prerequisites for you (because it requires a specific version >>> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >>> have to identify the exact location of the forks.pm being used. Or if >>> he is editing it as part of the install tarball, his edits may actually be >>> undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm module >>> that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm" >>> >>> You can even send me a copy of the file once it has been edited, and I >>> can tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >>> Our admin also tried to change the code in the forks.pm file as per >>> Carson's suggestion in another thread, but that didn't work either... As a >>> result I'm running Maker on my workstation (really slooow) till a solution >>> is found and since BLAST is a time-consuming step I was thinking of running >>> it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>> >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models >>>> that will be used for training the ab-initio predictors (like SNAP). >>>> Sometimes that means one run of MAKER for training; sometimes that means >>>> two runs of MAKER. You usually don't gain any accuracy after the second >>>> round of training. It's ok to use both EST and protein data for this >>>> training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your >>>> ab-initio predictors, then both est2genome and protein2genome should be set >>>> to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass >>>> those results as GFF3, but MAKER will install and run blast for you, and >>>> does a good job of keeping track of all those results and making them >>>> accessible to you in the end, so it's going to be a lot of work to do those >>>> blasts on your own outside of MAKER. I seriously suggest that you use blast >>>> internal to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> ------------------------------ >>>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>>> Panos Ioannidis [panos.ioannidis at gmail.com] >>>> *Sent:* Friday, July 11, 2014 5:56 AM >>>> *To:* maker-devel >>>> *Subject:* [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of >>>> questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's >>>> already sequenced, I will have to run maker twice (according to the >>>> tutorial). So for the first run I see that some people use only the ESTs >>>> and some others use ESTs and a protein database (CEGMA, Uniref50, >>>> Swiss-Prot, etc). I guess that the ESTs will give better models, but for >>>> the cases where genes aren't covered by an EST, it's okay to have a protein >>>> database to detect them as well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>>> Should they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and >>>> giving Maker directly the results. I guess that in this case, I'll have to >>>> first convert the BLAST output to a gff3 file and give it to the >>>> protein_gff parameter, right? >>>> >>>> Thanks, >>>> Panos >>>> >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 16 08:04:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 08:04:55 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: You don't have to do a system wide install. It is incredibly easy to have multiple installations of Perl. Perlbrew for example makes it easy to install and switch between multiple versions rapidly (and doesn't affect the system install) --> http://perlbrew.pl You can then test. The perl installation used by different programs is determined by the '#!' header in the executable script and not by the default location of your system's perl (look at the first line in .../maker/bin/maker and you will see what I mean). This value gets set during the initial installation, and whatever perl path you use to run MAKER's Build.PL script will end up being the one used to run MAKER, even if the system perl is different. --Carson From: Panos Ioannidis Date: Wednesday, July 16, 2014 at 6:26 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm > should not be a segfault error either, and would not > represent your error. The Perl 5.18/forks.pm is a different > issue where perl actually tells itself to die because hash reshuffling isn't > safe whereas segfaults are causes by binary corruption or incorrect memory > access issues (very different issues). I'd actually recommend a full perl > reinstall if you are getting segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel > > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only occurring > when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: >> If you are getting a segfault. It is more likely an MPI error especially if >> you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that >> have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel >> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! >> We'll give it another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this segfault >> is because of forks.pm , because if you remove the "use >> forks;" line script execution continues without segfault (till it crashes >> later for another reason, of course). In fact, even if you create a script >> with just the line "use forks;" and try to run it, you'll get a segfault. So >> it looks like it's something pretty general and serious, and I'm really >> surprised I can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >>> Also one more question. What is the exact error text you get for the forks >>> error? Is it a forks.pm error or is it an MPI warn on >>> fork error (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence >>> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. The >>> filtering and polishing done by MAKER is rather significant (direct BLAST is >>> actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm , your admin most likely edited >>> the wrong forks.pm . There may be more than one on your >>> system. If you let maker install some prerequisites for you (because it >>> requires a specific version of forks.pm ), it may be in >>> .../maker/perl/lib/forks.pm . Otherwise you have to >>> identify the exact location of the forks.pm being used. >>> Or if he is editing it as part of the install tarball, his edits may >>> actually be undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm >>> module that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm " >>> >>> You can even send me a copy of the file once it has been edited, and I can >>> tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our >>> admin also tried to change the code in the forks.pm file >>> as per Carson's suggestion in another thread, but that didn't work either... >>> As a result I'm running Maker on my workstation (really slooow) till a >>> solution is found and since BLAST is a time-consuming step I was thinking of >>> running it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models that >>>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>>> that means one run of MAKER for training; sometimes that means two runs of >>>> MAKER. You usually don't gain any accuracy after the second round of >>>> training. It's ok to use both EST and protein data for this training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>>> predictors, then both est2genome and protein2genome should be set to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>>> results as GFF3, but MAKER will install and run blast for you, and does a >>>> good job of keeping track of all those results and making them accessible >>>> to you in the end, so it's going to be a lot of work to do those blasts on >>>> your own outside of MAKER. I seriously suggest that you use blast internal >>>> to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> >>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>>> Ioannidis [panos.ioannidis at gmail.com] >>>> Sent: Friday, July 11, 2014 5:56 AM >>>> To: maker-devel >>>> Subject: [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's already >>>> sequenced, I will have to run maker twice (according to the tutorial). So >>>> for the first run I see that some people use only the ESTs and some others >>>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>>> that the ESTs will give better models, but for the cases where genes aren't >>>> covered by an EST, it's okay to have a protein database to detect them as >>>> well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>>> they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and giving >>>> Maker directly the results. I guess that in this case, I'll have to first >>>> convert the BLAST output to a gff3 file and give it to the protein_gff >>>> parameter, right? >>>> >>>> Thanks, >>>> Panos >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nguyenan at mail.nih.gov Wed Jul 16 11:15:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 17:15:10 +0000 Subject: [maker-devel] Maker_opts.ctl Message-ID: Hi, I would like to conduct a genome annotation and have the following data: - Two separate RepeatMasker outputs (using -lib and -species options) - ESTs and RACE (fasta) - proteins (fasta) - proteins of related organisms (fasta) - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) - GeneMark's .hmm file (es.mod file from running gm_es.pl) - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. - Cufflinks and Trinity from RNA-Seq Could you please let me know how can I specify parameters in the maker_opts.ctl file? Or do you have other suggestions to re-do the data listed above? Thanks. Anh-Dao From dence at genetics.utah.edu Wed Jul 16 12:13:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:13:46 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: Message-ID: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Hi Anh-Dao, In the maker_opts.ctl file, there are options for est and protein evidence. You?ll put all of your fasta est files together in a command separated list in the ?est" option, and all of your fasta protein files in a command separated list for the ?protein? option. You?ll specify the SNAP and Genemark files in their respective options in the control file and pass the augustus and fgenesh predictions in the ?pred_gff? option. If you have the RepeatMasker output in gff3 format you can give it to maker with the ?rm_gff? option. If you?ve converted the cufflinks output to gff3, you can give it to maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta output, so you would put that in the ?est? option, along with all the other est fasta files. If Augustus isn?t trained for your particular organism, then you can use another organism that augustus is already trained for. The list of species that augustus has parameter files for is in the README.txt that came with Augustus. I really recommend that you run Augustus from inside maker, because then you get all the benefits of maker passing ext-based hints to augustus at runtime, which can really improve Augustus? predictive ability. When you ran the augustus gene prediction separately, did you use another organism?s parameter file? Thanks, Daniel On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Hi, > > I would like to conduct a genome annotation and have the following data: > - Two separate RepeatMasker outputs (using -lib and -species options) > - ESTs and RACE (fasta) > - proteins (fasta) > - proteins of related organisms (fasta) > - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) > - GeneMark's .hmm file (es.mod file from running gm_es.pl) > - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. > - Cufflinks and Trinity from RNA-Seq > > Could you please let me know how can I specify parameters in the maker_opts.ctl file? > Or do you have other suggestions to re-do the data listed above? > > Thanks. > Anh-Dao > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 12:30:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 18:30:10 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Thanks Daniel for your quick response. I did not use the parameter file of other organism when running Augustus. I created the parameter file for the genome following their instructions. There were multiple steps to train and run Augustus (Creating gene structures for training AUGUSTUS with CEGMA => parameter file will be created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) As I mentioned the reason why I ran Augustus separately, because Augustus has not trained that genome (no parameter file exists). Otherwise I would run Augustus inside MAKER. You suggested to use rm_gff option to specify RepeatMasker output (sure I will convert them to .gff3 formatted files). Can I submit two RM .gff3 files, separated by comma? Anh-Dao On 7/16/14 2:13 PM, "Daniel Ence" wrote: >Hi Anh-Dao, > >In the maker_opts.ctl file, there are options for est and protein >evidence. You?ll put all of your fasta est files together in a command >separated list in the ?est" option, and all of your fasta protein files >in a command separated list for the ?protein? option. > >You?ll specify the SNAP and Genemark files in their respective options in >the control file and pass the augustus and fgenesh predictions in the >?pred_gff? option. > >If you have the RepeatMasker output in gff3 format you can give it to >maker with the ?rm_gff? option. > >If you?ve converted the cufflinks output to gff3, you can give it to >maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >output, so you would put that in the ?est? option, along with all the >other est fasta files. > >If Augustus isn?t trained for your particular organism, then you can use >another organism that augustus is already trained for. The list of >species that augustus has parameter files for is in the README.txt that >came with Augustus. I really recommend that you run Augustus from inside >maker, because then you get all the benefits of maker passing ext-based >hints to augustus at runtime, which can really improve Augustus? >predictive ability. > >When you ran the augustus gene prediction separately, did you use another >organism?s parameter file? > >Thanks, >Daniel > > >On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Hi, >> >> I would like to conduct a genome annotation and have the following data: >> - Two separate RepeatMasker outputs (using -lib and -species options) >> - ESTs and RACE (fasta) >> - proteins (fasta) >> - proteins of related organisms (fasta) >> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>format, etc. ) >> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>the outputs to .gff3 files. The reason why I ran Augustus gene >>prediction separately, because the genome has never been trained for >>Augustus. >> - Cufflinks and Trinity from RNA-Seq >> >> Could you please let me know how can I specify parameters in the >>maker_opts.ctl file? >> Or do you have other suggestions to re-do the data listed above? >> >> Thanks. >> Anh-Dao >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Wed Jul 16 12:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 12:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 12:41:47 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:41:47 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: If you can provide me the command lines you used to train augustus, I can point you to the proper species parameters to give to MAKER. Normally these are the same as one of the directory names under .../augustus/config/species/. You can also let MAKER run FGENESH for you. Either way you can pass it in as GFF3, but if you let MAEKR run it for you then MAEKR can "talk" to the predictor by giving it evidence based hints as it is running. This improves the overall performance of the algorithm compared to running it outside of MAKER. Thanks, Carson On 7/16/14, 12:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jul 16 12:42:16 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:42:16 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Hi Anh-Dao, so as I understand it, the process of training and running augustus will create a set of ?param? file that Augustus can use later on. If that?s true, then you can just copy those files to the ?config/species? folder of your augustus installation and then augustus (when you call it from inside maker) can use those parameters when it runs. Did you end up with a gff3 file or with files like ?exon_prob?, ?utr_probs? from augustus? Or did you have both? I?m pretty sure that you can?t use a comma-separated list for the rm_gff. You could concatenate the two files and then pass the one file to maker, but you also might need to have it sorted by genomic location. Carson could confirm that for me. ~Daniel On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Thanks Daniel for your quick response. > > I did not use the parameter file of other organism when running Augustus. > I created the parameter file for the genome following their instructions. > There were multiple steps to train and run Augustus (Creating gene > structures for training AUGUSTUS with CEGMA => parameter file will be > created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; > Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) > As I mentioned the reason why I ran Augustus separately, because Augustus > has not trained that genome (no parameter file exists). Otherwise I would > run Augustus inside MAKER. > > You suggested to use rm_gff option to specify RepeatMasker output (sure I > will convert them to .gff3 formatted files). Can I submit two RM .gff3 > files, separated by comma? > > Anh-Dao > > > On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >> Hi Anh-Dao, >> >> In the maker_opts.ctl file, there are options for est and protein >> evidence. You?ll put all of your fasta est files together in a command >> separated list in the ?est" option, and all of your fasta protein files >> in a command separated list for the ?protein? option. >> >> You?ll specify the SNAP and Genemark files in their respective options in >> the control file and pass the augustus and fgenesh predictions in the >> ?pred_gff? option. >> >> If you have the RepeatMasker output in gff3 format you can give it to >> maker with the ?rm_gff? option. >> >> If you?ve converted the cufflinks output to gff3, you can give it to >> maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >> output, so you would put that in the ?est? option, along with all the >> other est fasta files. >> >> If Augustus isn?t trained for your particular organism, then you can use >> another organism that augustus is already trained for. The list of >> species that augustus has parameter files for is in the README.txt that >> came with Augustus. I really recommend that you run Augustus from inside >> maker, because then you get all the benefits of maker passing ext-based >> hints to augustus at runtime, which can really improve Augustus? >> predictive ability. >> >> When you ran the augustus gene prediction separately, did you use another >> organism?s parameter file? >> >> Thanks, >> Daniel >> >> >> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>> format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>> the outputs to .gff3 files. The reason why I ran Augustus gene >>> prediction separately, because the genome has never been trained for >>> Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>> maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Wed Jul 16 12:43:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:43:33 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: You can use comma separated lists. --Carson On 7/16/14, 12:42 PM, "Daniel Ence" wrote: >Hi Anh-Dao, so as I understand it, the process of training and running >augustus will create a set of ?param? file that Augustus can use later >on. If that?s true, then you can just copy those files to the >?config/species? folder of your augustus installation and then augustus >(when you call it from inside maker) can use those parameters when it >runs. > >Did you end up with a gff3 file or with files like ?exon_prob?, >?utr_probs? from augustus? Or did you have both? > >I?m pretty sure that you can?t use a comma-separated list for the rm_gff. >You could concatenate the two files and then pass the one file to maker, >but you also might need to have it sorted by genomic location. Carson >could confirm that for me. > >~Daniel > > >On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Thanks Daniel for your quick response. >> >> I did not use the parameter file of other organism when running >>Augustus. >> I created the parameter file for the genome following their >>instructions. >> There were multiple steps to train and run Augustus (Creating gene >> structures for training AUGUSTUS with CEGMA => parameter file will be >> created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >> Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >> As I mentioned the reason why I ran Augustus separately, because >>Augustus >> has not trained that genome (no parameter file exists). Otherwise I >>would >> run Augustus inside MAKER. >> >> You suggested to use rm_gff option to specify RepeatMasker output (sure >>I >> will convert them to .gff3 formatted files). Can I submit two RM .gff3 >> files, separated by comma? >> >> Anh-Dao >> >> >> On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>> Hi Anh-Dao, >>> >>> In the maker_opts.ctl file, there are options for est and protein >>> evidence. You?ll put all of your fasta est files together in a command >>> separated list in the ?est" option, and all of your fasta protein files >>> in a command separated list for the ?protein? option. >>> >>> You?ll specify the SNAP and Genemark files in their respective options >>>in >>> the control file and pass the augustus and fgenesh predictions in the >>> ?pred_gff? option. >>> >>> If you have the RepeatMasker output in gff3 format you can give it to >>> maker with the ?rm_gff? option. >>> >>> If you?ve converted the cufflinks output to gff3, you can give it to >>> maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>> output, so you would put that in the ?est? option, along with all the >>> other est fasta files. >>> >>> If Augustus isn?t trained for your particular organism, then you can >>>use >>> another organism that augustus is already trained for. The list of >>> species that augustus has parameter files for is in the README.txt that >>> came with Augustus. I really recommend that you run Augustus from >>>inside >>> maker, because then you get all the benefits of maker passing ext-based >>> hints to augustus at runtime, which can really improve Augustus? >>> predictive ability. >>> >>> When you ran the augustus gene prediction separately, did you use >>>another >>> organism?s parameter file? >>> >>> Thanks, >>> Daniel >>> >>> >>> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>> format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>> the outputs to .gff3 files. The reason why I ran Augustus gene >>>> prediction separately, because the genome has never been trained for >>>> Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>> maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 13:07:45 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:07:45 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I will run Augustus and FGENESH++ inside of MAKER using the parameter files for Augustus. I could also run RepeatMasker inside of MAKER. However, I ran RM using two options: -lib (de novo) and -species (known). I got ~ 45% repeats via de novo and ~ 4% repeats via known options. As I understood, RM inside of MAKER uses only RepBase repeat library and RepeatRunner protein database. Anh-Dao On 7/16/14 2:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From nguyenan at mail.nih.gov Wed Jul 16 13:16:43 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:16:43 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I forget to mention that I ran RepeatModeler on the genome first, then used the output of RepeatModeler to submit to RepeatMasker using -lib option (de novo). For the -species option, I used metazoa Anh-Dao On 7/16/14 3:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From carsonhh at gmail.com Wed Jul 16 13:17:31 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:17:31 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: No. You can provide both to MAKER. The options are model_org= and rmlib=. By letting MAKER handle repeat masking it will differentiate repeat types and use soft masking for some and hard masking for others. This increases sensitivity of evidence alignments while still maintaining specificity. --Carson On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From nguyenan at mail.nih.gov Wed Jul 16 13:28:33 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:28:33 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: By default, model_org=all. Can I use the de novo repeat library predicted by RepeatModeler for the rmlib option? Anh-Dao On 7/16/14 3:17 PM, "Carson Holt" wrote: >No. You can provide both to MAKER. The options are model_org= and rmlib=. > By letting MAKER handle repeat masking it will differentiate repeat types >and use soft masking for some and hard masking for others. This increases >sensitivity of evidence alignments while still maintaining specificity. > >--Carson > > > >On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>files for Augustus. >>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>two >>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>novo and ~ 4% repeats via known options. As I understood, RM inside of >>MAKER uses only RepBase repeat library and RepeatRunner protein database. >> >>Anh-Dao >> >> >>On 7/16/14 2:36 PM, "Carson Holt" wrote: >> >>>When you ran Augustus separately, it should have created the parameters >>>needed to run it. Now you should be able to run it inside of MAKER >>>using >>>the species name you just created. >>> >>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>giving it the results as GFF3. >>> >>>--Carson >>> >>> >>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>Thanks Daniel for your quick response. >>>> >>>>I did not use the parameter file of other organism when running >>>>Augustus. >>>>I created the parameter file for the genome following their >>>>instructions. >>>>There were multiple steps to train and run Augustus (Creating gene >>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>As I mentioned the reason why I ran Augustus separately, because >>>>Augustus >>>>has not trained that genome (no parameter file exists). Otherwise I >>>>would >>>>run Augustus inside MAKER. >>>> >>>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>>I >>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>files, separated by comma? >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>> >>>>>Hi Anh-Dao, >>>>> >>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>evidence. You?ll put all of your fasta est files together in a command >>>>>separated list in the ?est" option, and all of your fasta protein >>>>>files >>>>>in a command separated list for the ?protein? option. >>>>> >>>>>You?ll specify the SNAP and Genemark files in their respective options >>>>>in >>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>?pred_gff? option. >>>>> >>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>maker with the ?rm_gff? option. >>>>> >>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>fasta >>>>>output, so you would put that in the ?est? option, along with all the >>>>>other est fasta files. >>>>> >>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>use >>>>>another organism that augustus is already trained for. The list of >>>>>species that augustus has parameter files for is in the README.txt >>>>>that >>>>>came with Augustus. I really recommend that you run Augustus from >>>>>inside >>>>>maker, because then you get all the benefits of maker passing >>>>>ext-based >>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>predictive ability. >>>>> >>>>>When you ran the augustus gene prediction separately, did you use >>>>>another >>>>>organism?s parameter file? >>>>> >>>>>Thanks, >>>>>Daniel >>>>> >>>>> >>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I would like to conduct a genome annotation and have the following >>>>>>data: >>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>options) >>>>>> - ESTs and RACE (fasta) >>>>>> - proteins (fasta) >>>>>> - proteins of related organisms (fasta) >>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>ZFF >>>>>>format, etc. ) >>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>convert >>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>prediction separately, because the genome has never been trained for >>>>>>Augustus. >>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>> >>>>>> Could you please let me know how can I specify parameters in the >>>>>>maker_opts.ctl file? >>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>> >>>>>> Thanks. >>>>>> Anh-Dao >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>> >>>> >>>>_______________________________________________ >>>>maker-devel mailing list >>>>maker-devel at box290.bluehost.com >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> > > From carsonhh at gmail.com Wed Jul 16 13:32:02 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:32:02 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: 'all' will use the whole of RepBase, or you can do 'metazoa' like your previous run. Then provide the RepeatModeler file to rmlib= --Carson On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >By default, model_org=all. Can I use the de novo repeat library predicted >by RepeatModeler for the rmlib option? > >Anh-Dao > > > >On 7/16/14 3:17 PM, "Carson Holt" wrote: > >>No. You can provide both to MAKER. The options are model_org= and >>rmlib=. >> By letting MAKER handle repeat masking it will differentiate repeat >>types >>and use soft masking for some and hard masking for others. This >>increases >>sensitivity of evidence alignments while still maintaining specificity. >> >>--Carson >> >> >> >>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>files for Augustus. >>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>two >>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>database. >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>> >>>>When you ran Augustus separately, it should have created the parameters >>>>needed to run it. Now you should be able to run it inside of MAKER >>>>using >>>>the species name you just created. >>>> >>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>giving it the results as GFF3. >>>> >>>>--Carson >>>> >>>> >>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>Thanks Daniel for your quick response. >>>>> >>>>>I did not use the parameter file of other organism when running >>>>>Augustus. >>>>>I created the parameter file for the genome following their >>>>>instructions. >>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>Augustus >>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>would >>>>>run Augustus inside MAKER. >>>>> >>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>(sure >>>>>I >>>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>>files, separated by comma? >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>> >>>>>>Hi Anh-Dao, >>>>>> >>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>command >>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>files >>>>>>in a command separated list for the ?protein? option. >>>>>> >>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>options >>>>>>in >>>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>>?pred_gff? option. >>>>>> >>>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>>maker with the ?rm_gff? option. >>>>>> >>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>fasta >>>>>>output, so you would put that in the ?est? option, along with all the >>>>>>other est fasta files. >>>>>> >>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>use >>>>>>another organism that augustus is already trained for. The list of >>>>>>species that augustus has parameter files for is in the README.txt >>>>>>that >>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>inside >>>>>>maker, because then you get all the benefits of maker passing >>>>>>ext-based >>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>predictive ability. >>>>>> >>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>another >>>>>>organism?s parameter file? >>>>>> >>>>>>Thanks, >>>>>>Daniel >>>>>> >>>>>> >>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>data: >>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>options) >>>>>>> - ESTs and RACE (fasta) >>>>>>> - proteins (fasta) >>>>>>> - proteins of related organisms (fasta) >>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>>ZFF >>>>>>>format, etc. ) >>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>convert >>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>prediction separately, because the genome has never been trained for >>>>>>>Augustus. >>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>> >>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>maker_opts.ctl file? >>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>> >>>>>>> Thanks. >>>>>>> Anh-Dao >>>>>>> >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>> >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >>> >> >> > From nguyenan at mail.nih.gov Thu Jul 17 08:19:34 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Thu, 17 Jul 2014 14:19:34 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I am not sure which fgenesh executable file should I use. fgenesh= #location of fgenesh executable When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need to specify a list of other executable programs (such as ppd, ppdn+, etc) Anh-Dao On 7/16/14 3:32 PM, "Carson Holt" wrote: >'all' will use the whole of RepBase, or you can do 'metazoa' like your >previous run. Then provide the RepeatModeler file to rmlib= > >--Carson > > > >On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>By default, model_org=all. Can I use the de novo repeat library predicted >>by RepeatModeler for the rmlib option? >> >>Anh-Dao >> >> >> >>On 7/16/14 3:17 PM, "Carson Holt" wrote: >> >>>No. You can provide both to MAKER. The options are model_org= and >>>rmlib=. >>> By letting MAKER handle repeat masking it will differentiate repeat >>>types >>>and use soft masking for some and hard masking for others. This >>>increases >>>sensitivity of evidence alignments while still maintaining specificity. >>> >>>--Carson >>> >>> >>> >>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>files for Augustus. >>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>two >>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>de >>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>database. >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>> >>>>>When you ran Augustus separately, it should have created the >>>>>parameters >>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>using >>>>>the species name you just created. >>>>> >>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>giving it the results as GFF3. >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>> wrote: >>>>> >>>>>>Thanks Daniel for your quick response. >>>>>> >>>>>>I did not use the parameter file of other organism when running >>>>>>Augustus. >>>>>>I created the parameter file for the genome following their >>>>>>instructions. >>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>Augustus >>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>would >>>>>>run Augustus inside MAKER. >>>>>> >>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>(sure >>>>>>I >>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>.gff3 >>>>>>files, separated by comma? >>>>>> >>>>>>Anh-Dao >>>>>> >>>>>> >>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>> >>>>>>>Hi Anh-Dao, >>>>>>> >>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>command >>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>files >>>>>>>in a command separated list for the ?protein? option. >>>>>>> >>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>options >>>>>>>in >>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>the >>>>>>>?pred_gff? option. >>>>>>> >>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>to >>>>>>>maker with the ?rm_gff? option. >>>>>>> >>>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>fasta >>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>the >>>>>>>other est fasta files. >>>>>>> >>>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>>use >>>>>>>another organism that augustus is already trained for. The list of >>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>that >>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>inside >>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>ext-based >>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>predictive ability. >>>>>>> >>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>another >>>>>>>organism?s parameter file? >>>>>>> >>>>>>>Thanks, >>>>>>>Daniel >>>>>>> >>>>>>> >>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>>data: >>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>options) >>>>>>>> - ESTs and RACE (fasta) >>>>>>>> - proteins (fasta) >>>>>>>> - proteins of related organisms (fasta) >>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>to >>>>>>>>ZFF >>>>>>>>format, etc. ) >>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>convert >>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>for >>>>>>>>Augustus. >>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>> >>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>maker_opts.ctl file? >>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Anh-Dao >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>. >>>>>>>>o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>> >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>>> >>> >>> >> > > From carsonhh at gmail.com Fri Jul 18 11:04:09 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 18 Jul 2014 11:04:09 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: It should just be 'fgenesh'. If it's not there you can still just give the GFF3. --Carson On 7/17/14, 8:19 AM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I am not sure which fgenesh executable file should I use. > >fgenesh= #location of fgenesh executable > >When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need >to specify a list of other executable programs (such as ppd, ppdn+, etc) > >Anh-Dao > > >On 7/16/14 3:32 PM, "Carson Holt" wrote: > >>'all' will use the whole of RepBase, or you can do 'metazoa' like your >>previous run. Then provide the RepeatModeler file to rmlib= >> >>--Carson >> >> >> >>On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>By default, model_org=all. Can I use the de novo repeat library >>>predicted >>>by RepeatModeler for the rmlib option? >>> >>>Anh-Dao >>> >>> >>> >>>On 7/16/14 3:17 PM, "Carson Holt" wrote: >>> >>>>No. You can provide both to MAKER. The options are model_org= and >>>>rmlib=. >>>> By letting MAKER handle repeat masking it will differentiate repeat >>>>types >>>>and use soft masking for some and hard masking for others. This >>>>increases >>>>sensitivity of evidence alignments while still maintaining specificity. >>>> >>>>--Carson >>>> >>>> >>>> >>>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>>files for Augustus. >>>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>>two >>>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>>de >>>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>>database. >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>>> >>>>>>When you ran Augustus separately, it should have created the >>>>>>parameters >>>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>>using >>>>>>the species name you just created. >>>>>> >>>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>>giving it the results as GFF3. >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>>> wrote: >>>>>> >>>>>>>Thanks Daniel for your quick response. >>>>>>> >>>>>>>I did not use the parameter file of other organism when running >>>>>>>Augustus. >>>>>>>I created the parameter file for the genome following their >>>>>>>instructions. >>>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>>structures for training AUGUSTUS with CEGMA => parameter file will >>>>>>>be >>>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>>Augustus >>>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>>would >>>>>>>run Augustus inside MAKER. >>>>>>> >>>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>>(sure >>>>>>>I >>>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>>.gff3 >>>>>>>files, separated by comma? >>>>>>> >>>>>>>Anh-Dao >>>>>>> >>>>>>> >>>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>>> >>>>>>>>Hi Anh-Dao, >>>>>>>> >>>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>>command >>>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>>files >>>>>>>>in a command separated list for the ?protein? option. >>>>>>>> >>>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>>options >>>>>>>>in >>>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>>the >>>>>>>>?pred_gff? option. >>>>>>>> >>>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>>to >>>>>>>>maker with the ?rm_gff? option. >>>>>>>> >>>>>>>>If you?ve converted the cufflinks output to gff3, you can give it >>>>>>>>to >>>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>>fasta >>>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>>the >>>>>>>>other est fasta files. >>>>>>>> >>>>>>>>If Augustus isn?t trained for your particular organism, then you >>>>>>>>can >>>>>>>>use >>>>>>>>another organism that augustus is already trained for. The list of >>>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>>that >>>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>>inside >>>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>>ext-based >>>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>>predictive ability. >>>>>>>> >>>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>>another >>>>>>>>organism?s parameter file? >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Daniel >>>>>>>> >>>>>>>> >>>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I would like to conduct a genome annotation and have the >>>>>>>>>following >>>>>>>>>data: >>>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>>options) >>>>>>>>> - ESTs and RACE (fasta) >>>>>>>>> - proteins (fasta) >>>>>>>>> - proteins of related organisms (fasta) >>>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>>to >>>>>>>>>ZFF >>>>>>>>>format, etc. ) >>>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>>convert >>>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>>for >>>>>>>>>Augustus. >>>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>>> >>>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>>maker_opts.ctl file? >>>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Anh-Dao >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> maker-devel mailing list >>>>>>>>> maker-devel at box290.bluehost.com >>>>>>>>> >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la >>>>>>>>>b >>>>>>>>>. >>>>>>>>>o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>> >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> > From jp.oeyen at uni-bonn.de Mon Jul 28 06:22:25 2014 From: jp.oeyen at uni-bonn.de (Jan Philip Oeyen) Date: Mon, 28 Jul 2014 14:22:25 +0200 Subject: [maker-devel] Forks.pm error when running maker with dsindex Message-ID: Hi all, we are currently having some unexpected errors when running maker on a genome which is split in several parts. Our cluster admin reported the following error message: Argument "ALRM" isn't numeric in exit at /share/scientific_bin/perlmodu les/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 2188. SIGTERM received SIGTERM received SIGTERM received We were using maker with the '-g' option on a single genome which is split into 20 parts, where 19 parts are equally large and the last contains about 20 sequences more. After that we ran Maker using dsindex to clean up the output. We are currently using maker v2.31 on 4 threads and forks v0.34. If any further info is needed to clarify the problem, please let me know and I will provide as much as possible. Thank you for your help! Best regards, Jan Philip Oeyen ZFMK // ZMB // University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mphoeppner at gmail.com Wed Jul 30 04:44:36 2014 From: mphoeppner at gmail.com (=?iso-8859-1?Q?Marc_H=F6ppner?=) Date: Wed, 30 Jul 2014 12:44:36 +0200 Subject: [maker-devel] Maker GFF output with features of 0 length Message-ID: <5C45F418-018B-4ACC-B682-E5659DB7F102@gmail.com> Hi, I?ve - more by accident - found that many of the gene builds I have generated with Maker (2.31.3) contain features with identical start and stop positions. For example: scaffold_2927 maker CDS 13013 13013 . + 1 ID=maker-scaffold_2927-augustus-gene-0.8-mRNA-1:cds;Parent=maker-scaffold_2927-augustus-gene-0.8-mRNA-1 This occurs seemingly randomly for all sorts of feature types and I have only seen this when running Maker on full assemblies. Before I start turning every stone, any ideas about possible explanations for this phenomenon? Is this likely some MPI-related communication issue, or NFS problems with synching data? Maker runs fine on our system, but that doesn?t mean that there aren?t any cryptic issues that only on these occasions read their head? Regarding the frequency, out of 450.000 GFF lines, 270 were affected in the case that I looked into the most. So it is pretty rare, but still... I am currently using Maker with openmpi-1.7.4 and the file system is mounter of NFS4 and IPoIB. I now switched to Maker 2.31.6, but have no strong reason to suspect that this will make a difference. Regards, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 3 08:12:07 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:12:07 -0600 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: The hints used by MAKER are CDSpart, exonpart, intronpart, and intron. You can play around with the extrinsic evidence configuration file if you want, but it's really not well documented, so I won't be able to provide much support. Thanks, Carson On 7/1/14, 6:31 AM, "Marc H?ppner" wrote: >Hi, > >sorry for resurrecting this topic. The issue was about the use of >ab-intio predictions and artefacts in the final maker gene builds. > >I think one potential issue that hasn?t been discussed here concerns >Makers? use of the extrinsic config file when running Augustus. This file >controls the ?weights? of different types of hints when running Augustus. >I don?t think it is made clear anywhere which extrinsic config file Maker >reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it >suggested that it would be useful to manipulate this file to improve >augustus performance (and in extension Makers performance). Finally, I am >not entirely sure which sorts of hints Maker creates for Augustus and to >which hint categories these would belong to (i.e. it makes no sense to >tweak the intronpart malus factor if Maker does not create such hints). >Perhaps it would be good to elaborate on that in the Maker documentation, >since it seems to be quite relevant for obtaining better results. Or does >such an explanation already exist somewhere? > > >/Marc > >Marc P. Hoeppner, PhD >Team Leader >BILS Genome Annotation Platform >Department for Medical Biochemistry and Microbiology >Uppsala University, Sweden >marc.hoeppner at bils.se > >On 05 Jun 2014, at 20:28, Carson Holt wrote: > >> One thing you might want to try is adding another predictor like SNAP >> together with Augustus and then process the MAKER results using EVM. We >> actually have a collaboration with the EVM group to produce a MAKER-EVM >> pipeline (MAKER 3.0). EVM will produce consensus models using the >> predictions and the evidence in the MAKER GFF3 which are generally >>better >> than just SNAP and Augustus with hints, so it might be able to remove >>some >> of the artifacts you are worried about. >> >> --Carson >> >> >> >> On 6/5/14, 12:24 PM, "Carson Holt" wrote: >> >>> Like I said. The predictors do the best they can, so there is probably >>> something about the regions to make the CDS, reading frame, or >>>start/stop >>> work that requires exons to be dropped or added. In several ant >>>genomes >>> we saw something like this caused by incorrect homopolymers in the >>> assembly which force the predictor to slightly alter the intron/exon >>> structure because otherwise the reading frame made no sense (the EST >>> alignments were used to confirmed that the assembly homopolymers were >>> incorrect - lots of bad single base pair deletions). >>> >>> The way hints work is as follows. At the simplest level ab initio >>> predictors are calculating the probability of being in different states >>> (intergenic, intron, exon, etc.). The hints increase the probability >>>of >>> being in the intron state where MAKER gives an intron hint or being in >>>an >>> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >>> likelihood of the ab intio gene predictor to call something similar in >>> structure to the evidence overlapping it. That being said, if there is >>> strong enough signal from something else in the sequence or my hints >>>won't >>> work with the splice sites in the region or the reading frame breaks, >>>then >>> no amount of hints can force augustus to make that model. >>> >>> --Carson >>> >>> >>> >>> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >>> >>>> Hi, >>>> >>>> thanks for the feedback. I spent some more time on this and am still >>>> somewhat unsatisfied with the whole thing? >>>> >>>> A few comments: >>>> >>>> I quite frequently see augustus and in extension Maker including exons >>>> that are not supported by EST/Protein evidence and are not critical >>>>for >>>> the gene model (i.e. I can take them out and still get a proper CDS). >>>> Maybe I don?t know enough about how Maker creates hints and more >>>> importantly what role these hints play for augustus, but I cannot >>>>really >>>> see a great effect (any, really) on the gene models even if both ESTs >>>>and >>>> proteins contradict an augustus gene model and the surplus exon is >>>> non-essential. >>>> >>>> (all evidence is provided as fasta files, protein2genome and >>>>est2genome >>>> are set to 0) >>>> >>>> As for the repeat library, I suppose this is a critical point. I am >>>>using >>>> repeats from a closely related species via Repeatmasker, modelled and >>>> filtered repeats from RepeatModeler and repeats derived from a >>>> high-coverage 454 data set. Not sure what else I can do to improve >>>>that. >>>> >>>> As for evidence, I am using the curated reference proteome from a >>>>related >>>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>>> reads. I don?t think it gets a whole lot better, in terms of what data >>>> can be used. >>>> >>>> So in summary, I just don?t get where I want to using Augustus and >>>>Maker >>>> - specifically, the gene models are full of weird, unsupported >>>>artefacts >>>> despite manually curating > 850 models for training. I suppose I was >>>> hoping for some secret trick to improve on this - but I guess there is >>>> none? Actually, if I only do a pure evidence build (seeing that my >>>>input >>>> data is very high quality), it looks better - which sort of goes >>>>against >>>> the premise of Maker :/ >>>> >>>> Regards, >>>> >>>> Marc >>>> >>>> >>>> >>>> >>>> Marc P. Hoeppner, PhD >>>> Team Leader >>>> Department for Medical Biochemistry and Microbiology >>>> Uppsala University, Sweden >>>> marc.hoeppner at bils.se >>>> >>>> On 27 May 2014, at 17:25, Carson Holt wrote: >>>> >>>>> Extra exons can be required for predictors to make sense of a region >>>>> (they >>>>> do the best they can). This can be due to imperfect assemblies or >>>>> repeats. For plants the repeat database is the the one thing that >>>>>will >>>>> most affect the annotation quality. You may need to spend some time >>>>> building the best repeat library you can. The repeat library is the >>>>> next >>>>> most important thing next to training the predictor, because they >>>>> confuse >>>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>>> regions (because repeats do encode real protein and protein >>>>>fragments). >>>>> Also when running now with MAKER make sure to include the entire >>>>> proteome >>>>> of a related species and not just UniProt, and you will get better >>>>> performance. Now that you have Augustus trained, using it inside of >>>>> MAKER >>>>> with an improved repeat library and additional protein evidence >>>>>should >>>>> give it the feedback that will allow it to perform better than it >>>>>would >>>>> with just naked ab initio prediction. >>>>> >>>>> Thanks, >>>>> Carson >>>>> >>>>> >>>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I wanted to get some feedback regarding the training of ab-initio >>>>>>gene >>>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>>> many >>>>>> people on this list that have encountered and solved this issue in >>>>>>one >>>>>> way or another. >>>>>> >>>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for >>>>>>a >>>>>> plant genome. This has always been a very frustrating process for >>>>>>me, >>>>>> but >>>>>> while I have a better idea now how to do it, I still don?t get the >>>>>> sort >>>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>>> >>>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>>> Sanger-sequenced EST data >>>>>> >>>>>> Filtered for Models with an AED <= 0.3 >>>>>> >>>>>> Loaded that into WebApollo, together with an existing reference >>>>>> annotation and the evidence tracks >>>>>> >>>>>> Manually curated/selected 750 gene models using the following rules: >>>>>> - Must have start/stop codon >>>>>> - Most have as many exons as possible >>>>>> - Must agree with evidence >>>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>>> regions for augustus to train intergenic sequence) >>>>>> >>>>>> From these models, I created a GBK file, split it into 650 (train) >>>>>> and >>>>>> 100 (test) models and created a new profile using the documented >>>>>> procedure. >>>>>> >>>>>> But: >>>>>> >>>>>> While the naked ab-init models created through maker get a lot of >>>>>> genes >>>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>>> Problems include: >>>>>> >>>>>> - random exon calls which are not supported by any line of evidence >>>>>> (~1 >>>>>> per gene model, I would guess) >>>>>> - poor congruency with some gene models (especially ones not used >>>>>>for >>>>>> training/testing) >>>>>> >>>>>> Is there any best-practice guide on how to improve this? The >>>>>>Augustus >>>>>> website is unfortunately quite poor on detail? My impression so far >>>>>>is >>>>>> that ramping up the number of training models isn?t really doing too >>>>>> much >>>>>> beyond a certain point (tried 400, 500 and 750). >>>>>> >>>>>> Regards, >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> Marc P. Hoeppner, PhD >>>>>> Team Leader >>>>>> BILS Genome Annotation Platform >>>>>> Department for Medical Biochemistry and Microbiology >>>>>> Uppsala University, Sweden >>>>>> marc.hoeppner at bils.se >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>rg >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From marc.hoeppner at bils.se Tue Jul 1 06:31:33 2014 From: marc.hoeppner at bils.se (=?windows-1252?Q?Marc_H=F6ppner?=) Date: Tue, 1 Jul 2014 14:31:33 +0200 Subject: [maker-devel] Some questions regarding ab-initio training In-Reply-To: References: <1CD4559D-7A9D-4F8C-92F4-F5228F4E23B8@bils.se> <7FFAF2D2-3D32-40CF-8120-E6F858F74F1C@bils.se> Message-ID: Hi, sorry for resurrecting this topic. The issue was about the use of ab-intio predictions and artefacts in the final maker gene builds. I think one potential issue that hasn?t been discussed here concerns Makers? use of the extrinsic config file when running Augustus. This file controls the ?weights? of different types of hints when running Augustus. I don?t think it is made clear anywhere which extrinsic config file Maker reads (from the logs it seems to be extrinsic.MPE.cfg. Nor is it suggested that it would be useful to manipulate this file to improve augustus performance (and in extension Makers performance). Finally, I am not entirely sure which sorts of hints Maker creates for Augustus and to which hint categories these would belong to (i.e. it makes no sense to tweak the intronpart malus factor if Maker does not create such hints). Perhaps it would be good to elaborate on that in the Maker documentation, since it seems to be quite relevant for obtaining better results. Or does such an explanation already exist somewhere? /Marc Marc P. Hoeppner, PhD Team Leader BILS Genome Annotation Platform Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at bils.se On 05 Jun 2014, at 20:28, Carson Holt wrote: > One thing you might want to try is adding another predictor like SNAP > together with Augustus and then process the MAKER results using EVM. We > actually have a collaboration with the EVM group to produce a MAKER-EVM > pipeline (MAKER 3.0). EVM will produce consensus models using the > predictions and the evidence in the MAKER GFF3 which are generally better > than just SNAP and Augustus with hints, so it might be able to remove some > of the artifacts you are worried about. > > --Carson > > > > On 6/5/14, 12:24 PM, "Carson Holt" wrote: > >> Like I said. The predictors do the best they can, so there is probably >> something about the regions to make the CDS, reading frame, or start/stop >> work that requires exons to be dropped or added. In several ant genomes >> we saw something like this caused by incorrect homopolymers in the >> assembly which force the predictor to slightly alter the intron/exon >> structure because otherwise the reading frame made no sense (the EST >> alignments were used to confirmed that the assembly homopolymers were >> incorrect - lots of bad single base pair deletions). >> >> The way hints work is as follows. At the simplest level ab initio >> predictors are calculating the probability of being in different states >> (intergenic, intron, exon, etc.). The hints increase the probability of >> being in the intron state where MAKER gives an intron hint or being in an >> exon/CDS state when MAKER gives an exon/CDS hint. So this bends the >> likelihood of the ab intio gene predictor to call something similar in >> structure to the evidence overlapping it. That being said, if there is >> strong enough signal from something else in the sequence or my hints won't >> work with the splice sites in the region or the reading frame breaks, then >> no amount of hints can force augustus to make that model. >> >> --Carson >> >> >> >> On 6/5/14, 2:15 AM, "Marc H?ppner" wrote: >> >>> Hi, >>> >>> thanks for the feedback. I spent some more time on this and am still >>> somewhat unsatisfied with the whole thing? >>> >>> A few comments: >>> >>> I quite frequently see augustus and in extension Maker including exons >>> that are not supported by EST/Protein evidence and are not critical for >>> the gene model (i.e. I can take them out and still get a proper CDS). >>> Maybe I don?t know enough about how Maker creates hints and more >>> importantly what role these hints play for augustus, but I cannot really >>> see a great effect (any, really) on the gene models even if both ESTs and >>> proteins contradict an augustus gene model and the surplus exon is >>> non-essential. >>> >>> (all evidence is provided as fasta files, protein2genome and est2genome >>> are set to 0) >>> >>> As for the repeat library, I suppose this is a critical point. I am using >>> repeats from a closely related species via Repeatmasker, modelled and >>> filtered repeats from RepeatModeler and repeats derived from a >>> high-coverage 454 data set. Not sure what else I can do to improve that. >>> >>> As for evidence, I am using the curated reference proteome from a related >>> species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 >>> reads. I don?t think it gets a whole lot better, in terms of what data >>> can be used. >>> >>> So in summary, I just don?t get where I want to using Augustus and Maker >>> - specifically, the gene models are full of weird, unsupported artefacts >>> despite manually curating > 850 models for training. I suppose I was >>> hoping for some secret trick to improve on this - but I guess there is >>> none? Actually, if I only do a pure evidence build (seeing that my input >>> data is very high quality), it looks better - which sort of goes against >>> the premise of Maker :/ >>> >>> Regards, >>> >>> Marc >>> >>> >>> >>> >>> Marc P. Hoeppner, PhD >>> Team Leader >>> Department for Medical Biochemistry and Microbiology >>> Uppsala University, Sweden >>> marc.hoeppner at bils.se >>> >>> On 27 May 2014, at 17:25, Carson Holt wrote: >>> >>>> Extra exons can be required for predictors to make sense of a region >>>> (they >>>> do the best they can). This can be due to imperfect assemblies or >>>> repeats. For plants the repeat database is the the one thing that will >>>> most affect the annotation quality. You may need to spend some time >>>> building the best repeat library you can. The repeat library is the >>>> next >>>> most important thing next to training the predictor, because they >>>> confuse >>>> the predictor (sometimes a lot) causing it to behave oddly in those >>>> regions (because repeats do encode real protein and protein fragments). >>>> Also when running now with MAKER make sure to include the entire >>>> proteome >>>> of a related species and not just UniProt, and you will get better >>>> performance. Now that you have Augustus trained, using it inside of >>>> MAKER >>>> with an improved repeat library and additional protein evidence should >>>> give it the feedback that will allow it to perform better than it would >>>> with just naked ab initio prediction. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> On 5/27/14, 2:12 AM, "Marc H?ppner" wrote: >>>> >>>>> Hi, >>>>> >>>>> I wanted to get some feedback regarding the training of ab-initio gene >>>>> finders - it?s not strictly Maker related, but I suppose there are >>>>> many >>>>> people on this list that have encountered and solved this issue in one >>>>> way or another. >>>>> >>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a >>>>> plant genome. This has always been a very frustrating process for me, >>>>> but >>>>> while I have a better idea now how to do it, I still don?t get the >>>>> sort >>>>> of accuracy that I am hoping for. A quick run-through of my process; >>>>> >>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot + >>>>> Sanger-sequenced EST data >>>>> >>>>> Filtered for Models with an AED <= 0.3 >>>>> >>>>> Loaded that into WebApollo, together with an existing reference >>>>> annotation and the evidence tracks >>>>> >>>>> Manually curated/selected 750 gene models using the following rules: >>>>> - Must have start/stop codon >>>>> - Most have as many exons as possible >>>>> - Must agree with evidence >>>>> - Must be >= 2kb part from other gene models (provided as flanking >>>>> regions for augustus to train intergenic sequence) >>>>> >>>>> From these models, I created a GBK file, split it into 650 (train) >>>>> and >>>>> 100 (test) models and created a new profile using the documented >>>>> procedure. >>>>> >>>>> But: >>>>> >>>>> While the naked ab-init models created through maker get a lot of >>>>> genes >>>>> ?sort of right?, I still see too many issues to be really satisfied. >>>>> Problems include: >>>>> >>>>> - random exon calls which are not supported by any line of evidence >>>>> (~1 >>>>> per gene model, I would guess) >>>>> - poor congruency with some gene models (especially ones not used for >>>>> training/testing) >>>>> >>>>> Is there any best-practice guide on how to improve this? The Augustus >>>>> website is unfortunately quite poor on detail? My impression so far is >>>>> that ramping up the number of training models isn?t really doing too >>>>> much >>>>> beyond a certain point (tried 400, 500 and 750). >>>>> >>>>> Regards, >>>>> >>>>> Marc >>>>> >>>>> >>>>> Marc P. Hoeppner, PhD >>>>> Team Leader >>>>> BILS Genome Annotation Platform >>>>> Department for Medical Biochemistry and Microbiology >>>>> Uppsala University, Sweden >>>>> marc.hoeppner at bils.se >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From rajesh.bommareddy at tu-harburg.de Thu Jul 3 08:45:59 2014 From: rajesh.bommareddy at tu-harburg.de (Rajesh Reddy Bommareddy) Date: Thu, 03 Jul 2014 16:45:59 +0200 Subject: [maker-devel] Maker output Message-ID: <53B56CA7.80108@tu-harburg.de> Dear Maker group I have run the example files provided with maker. But i am unable to understand the output. Where can i find the information about exons, CDS, protein sequence of the predicted CDS or mRNA and the predicted protein name for each contig? Thanks and Regards -- M.Sc. Rajesh Reddy Bommareddy Institute of Bioprocess and Biosystems Engineering Hamburg University of Technology(TUHH) Denikestrasse 15, 20171 Hamburg GERMANY. e.mail: rajesh.bommareddy at tu-harburg.de Phone:+4940428784011 Mobile:+4917663673522 From carsonhh at gmail.com Thu Jul 3 08:51:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 03 Jul 2014 08:51:57 -0600 Subject: [maker-devel] Maker output In-Reply-To: <53B56CA7.80108@tu-harburg.de> References: <53B56CA7.80108@tu-harburg.de> Message-ID: See the MAKER 2014 GMOD tutorial --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_ GMOD_Online_Training_2014 Also watch accompanying video --> http://youtu.be/uA96tSSaqLk Results will be in GFF3 and FASTA format. The GFF3 file contains the location of structure relative to the assembly (exon/CDS/UTR). The FASTA file contains the sequence (transcript/protein). There will be separate files for each contig. Use gff3_merge and fasta_merge to generate merged genome wide GFF3 and FASTA files. An explanation of GFF3 format is here --> http://www.sequenceontology.org/gff3.shtml Thanks, Carson On 7/3/14, 8:45 AM, "Rajesh Reddy Bommareddy" wrote: >Dear Maker group > >I have run the example files provided with maker. But i am unable to >understand the output. Where can i find the information about exons, >CDS, protein sequence of the predicted CDS or mRNA and the predicted >protein name for each contig? > > >Thanks and Regards >-- >M.Sc. Rajesh Reddy Bommareddy >Institute of Bioprocess and Biosystems Engineering >Hamburg University of Technology(TUHH) >Denikestrasse 15, 20171 Hamburg >GERMANY. >e.mail: rajesh.bommareddy at tu-harburg.de >Phone:+4940428784011 >Mobile:+4917663673522 > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From dence at genetics.utah.edu Mon Jul 7 08:24:33 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 14:24:33 +0000 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: <8219A0C0-DBB0-4417-8B4F-39D6D7F93B93@genetics.utah.edu> Hi Saad, I think that's correct. As a sub step for each of the steps you listed, I would also choose one or two large scaffolds out of your assembly to use as a test set and use that test set to make sure that all you are getting output like you'd expect, before running MAKER on the whole genome. Let me know if there's anything else we can do to help. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 7:08 AM, Saad Arif > wrote: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt > wrote: "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors --Carson From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM To: Daniel Ence > Cc: ">" > Subject: Re: [maker-devel] Help with updating an annotation Thanks Daniel. I think it's more clear to me now. So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? Let me know if i'm still missing something here. Thanks in advance. best, Saad On 18 Jun 2014, at 17:21, Daniel Ence wrote: Hi Saad, Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Let me know if that helps, or if you have more question ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jun 18, 2014, at 5:09 AM, Saad Arif > wrote: Thank you for the response. I still have one question though, with these options: est_GFF=cufflinksout.GFF modle_GFF= ensembl reference.GFF What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? Is there a simple way to combine adding (new genes) and improving of an existing annotation? Any feedback on this would be greatly appreciated. saad On 13 Jun 2014, at 17:59, Carson Holt wrote: Use the cufflinks instead of the tophat features (tophat tends to be really noisy). Give the existing models to model_gff (they will then always be kept unless something better is found). There is no option to keep models and then just add isoforms. The model_gff input will either be kept as is (unchanged), or replaced with an updated model suggested by the evidence (the updated model may contain multiple isoforms though), and map_forward=1 can be used to pull names forward from the old model onto the new models. Thansk, Carson On 6/13/14, 5:03 AM, "Saad Arif" > wrote: Dear All, I would like to use Maker pipeline to expand a current annotation (new isoforms and novel genes with respect to current annotation) and was wondering if anyone had experience with this and or suggestions to my questions. Briefly: I have tophat splice junctions from RNAseq data or alternatively cufflinks generated transcript models (fasts format) that i want to use as my new data (est_gff or est). I want to provide the current Ensembl annotation for gene prediction but i want this annotation to remain unchanged. Hence, i?m not sure if i should provide this annotation as pred_gff or model_gff. Can the model_gff be used for gene prediction or is this just a subset of pred_gff that remain unaltered? Can we provide the same annotation for both options (pred_ and mod_gff)? Importantly, my main goal is to use the new RNAseq data to add more isoforms and (any) novel genes to the existing Ensembl annotation. Any thoughts or suggestions on how to go about this would be sincerely appreciated. Thanks in advance, saad _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 09:26:05 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 11:26:05 -0400 Subject: [maker-devel] Couple quick questions about Maker Message-ID: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon Jul 7 10:00:45 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 7 Jul 2014 16:00:45 +0000 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: Message-ID: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: Hi, I'm trying to run maker on a couple genomes right now and was wondering if folks had any thoughts on way to speed it up a bit. I'm running it on a 48-processor supercomputer (lots of RAM, usually use it for genome assembly). Both these genomes are a little fragmented, so there are lots of contigs, which slows down the whole process. I am looking for ways to speed things up and was wondering about a couple things: 1) I'm currently just at the first round of maker predictions using EST and protein evidence to build models. Had already done RepeatMasking so thought I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should generally allow the program to bypass the RepeatMasking step, correct? Does it also make it bypass the Repeat ORF searching step? 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step SNAP training from the tutorials seems straightforward, but I was wondering about the Augustus step. From what I can tell, simply providing an Augustus "trained" species name should turn on Augustus and blast/blat-like hints generated within Maker are then used in gene prediction. Any thoughts on if it's either more accurate or faster to do the Augustus predictions outside of the Maker pipeline and then import them using the pred_gff parameter in the maker_opts file? 3) Finally, I noticed that you had a script for converting cegma gff files to zff file for snap training? Currently, I am using predicted transcript for this species and protein sequences from related species to training. Does anyone have any insight into using CEGMA results as well? Do you work iteratively with them? For instance, start with the using hints from more distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything in at once and retrain after that? Thanks in advance for any advice and insight. Cheers, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [LinkedIn] [http://ws-stats.appspot.com/ga/pixel.png?yes__count=true%20&e=legacy_impression] _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 10:26:43 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 10:26:43 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff option (which is pretty different). Also If you provide GFF# files for repeats, you will still need to turn of repeat masking in the control files by blanking out the options. Also MAKER uses a step called RepeatRunner against an internal transposable element protein databases which is probably still running (and is slow because it's a search in translated protein space). For performance, you may want to give a larger max_dna_len for the MAKER run given that you have a large RAM machine. Also set all the depth_blast in maker_bopts.ctl to 15 or 20. CEGMA is convenient for training predictors because it finds genes that will always be in every eukaryote (I.e. high confidence). You can combine these with est2genome/protein2genome results from MAKER if you want. You can then use the resulting HMM for a larger MAKER run with experimental evidence, and then train again on those results. But beware than there is rarely any benefit from training beyond that second round. More training actually tends to makes things worse (the overtraining paradox). --Carson From: Daniel Ence Date: Monday, July 7, 2014 at 10:00 AM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Hi Nathaniel, 1) We'll need to see the error messages that MAKER was giving to understand what might have gone wrong with the Repeat Masker gff3 file. If you could run maker on one of your scaffolds with your current settings and send us the complete output, we can start to figure out what happened. 2) MAKER interacts with its gene predictors (augustus, snap, and the other ones listed in the control files) in a way that improves their performance (with the hints and such). When you supply predictions through the pred_gff parameter, MAKER can't give that performance improvement, so there's something of a tradeoff there. I think the performance improvement is a key part of MAKER's success, so I would definitely recommend running the ab-initio tools internally. MAKER tries to save you time by saving results from run to run and only rerunning tools (usually blast tools) that had their parameters changed in the control files. Taking advantage of that will probably be the biggest time saver for you. Something else that could save you almost as much time would be to set a reasonable lower-bound on the size of contigs that maker will try to annotate (usually <5kbp or <10kbp depending on your genome). This parameter is set with the min_contig parameter. I'll have to check with my lab mates about the Repeat ORF searching and how they use CEGMA results. I think you can probably just put them all in there at once though. ~Daniel Daniel Ence Graduate Student dence at genetics.utah.edu Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 On Jul 7, 2014, at 9:26 AM, Nathaniel Jue wrote: > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome assembly). > Both these genomes are a little fragmented, so there are lots of contigs, > which slows down the whole process. I am looking for ways to speed things up > and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST and > protein evidence to build models. Had already done RepeatMasking so thought > I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so > two questions: i) any thoughts on why that GFF wasn't acceptable? It's the one > that repeatmasker outputs if you ask it to; and ii) Providing this GFF, should > generally allow the program to bypass the RepeatMasking step, correct? Does it > also make it bypass the Repeat ORF searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The two-step > SNAP training from the tutorials seems straightforward, but I was wondering > about the Augustus step. From what I can tell, simply providing an Augustus > "trained" species name should turn on Augustus and blast/blat-like hints > generated within Maker are then used in gene prediction. Any thoughts on if > it's either more accurate or faster to do the Augustus predictions outside of > the Maker pipeline and then import them using the pred_gff parameter in the > maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files to > zff file for snap training? Currently, I am using predicted transcript for > this species and protein sequences from related species to training. Does > anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw everything > in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > Nathaniel Jue, Ph.D. > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > > iel-jue%2F1%2F531%2F176%2F&sn=> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Mon Jul 7 11:21:50 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Mon, 7 Jul 2014 13:21:50 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 7 11:26:34 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 07 Jul 2014 11:26:34 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just ^C. If you change options, then it will restart at a point determined by what will be affected by the change. Since repeat masking affects everything downstream, everything will start from zero. If it was a step like changing the HMM or altering blastn_depth, then it would be less disruptive and MAKER could reuse all existing raw reports. Unfortunately it's not that way for altering repeat masking options. --Carson From: Nathaniel Jue Date: Monday, July 7, 2014 at 11:21 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Thanks for the input guys. I'm guessing the error was probably from not turning off the repeat prediction or the GFF file. I have a script for converting repeatmasker output to gff so maybe I'll just try that if I want to follow-up on it. Thanks for the tips on parameter adjustments and thoughts on running the program as well. Just a few more quick follow-up questions for you: is there a preferred method for stopping a job so that it will be able to restart while maximizing the benefits of the run.logs, etc.? Or just Crtl-C it and start over? Seems like if I adjust those parameter values it may restart from the very beginning as changing the opts file sometimes does that. It that to be expected? If so, should I just bite the bullet and restart from the beginning or is it best to finish a run and then change options? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.jue at uconn.edu Tue Jul 8 09:56:37 2014 From: n.jue at uconn.edu (Nathaniel Jue) Date: Tue, 8 Jul 2014 11:56:37 -0400 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate *Nathaniel Jue, Ph.D.* Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 [image: LinkedIn] On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the > -gff option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files > by blanking out the options. Also MAKER uses a step called RepeatRunner > against an internal transposable element protein databases which is > probably still running (and is slow because it's a search in translated > protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER > run given that you have a large RAM machine. Also set all the depth_blast > in maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that > will always be in every eukaryote (I.e. high confidence). You can combine > these with est2genome/protein2genome results from MAKER if you want. You > can then use the resulting HMM for a larger MAKER run with experimental > evidence, and then train again on those results. But beware than there is > rarely any benefit from training beyond that second round. More training > actually tends to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to > understand what might have gone wrong with the Repeat Masker gff3 file. If > you could run maker on one of your scaffolds with your current settings and > send us the complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's > something of a tradeoff there. I think the performance improvement is a key > part of MAKER's success, so I would definitely recommend running the > ab-initio tools internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in > the control files. Taking advantage of that will probably be the biggest > time saver for you. Something else that could save you almost as much time > would be to set a reasonable lower-bound on the size of contigs that maker > will try to annotate (usually <5kbp or <10kbp depending on your genome). > This parameter is set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and > how they use CEGMA results. I think you can probably just put them all in > there at once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > > Hi, > > I'm trying to run maker on a couple genomes right now and was wondering if > folks had any thoughts on way to speed it up a bit. I'm running it on a > 48-processor supercomputer (lots of RAM, usually use it for genome > assembly). Both these genomes are a little fragmented, so there are lots of > contigs, which slows down the whole process. I am looking for ways to speed > things up and was wondering about a couple things: > > 1) I'm currently just at the first round of maker predictions using EST > and protein evidence to build models. Had already done RepeatMasking so > thought I'd just input subsequent GFF to speed it up. Didn't seem to like > the GFF, so two questions: i) any thoughts on why that GFF wasn't > acceptable? It's the one that repeatmasker outputs if you ask it to; and > ii) Providing this GFF, should generally allow the program to bypass the > RepeatMasking step, correct? Does it also make it bypass the Repeat ORF > searching step? > > 2) I plan to run both SNAP and Augustus on these genomes as well. The > two-step SNAP training from the tutorials seems straightforward, but I was > wondering about the Augustus step. From what I can tell, simply providing > an Augustus "trained" species name should turn on Augustus and > blast/blat-like hints generated within Maker are then used in gene > prediction. Any thoughts on if it's either more accurate or faster to do > the Augustus predictions outside of the Maker pipeline and then import them > using the pred_gff parameter in the maker_opts file? > > 3) Finally, I noticed that you had a script for converting cegma gff files > to zff file for snap training? Currently, I am using predicted transcript > for this species and protein sequences from related species to training. > Does anyone have any insight into using CEGMA results as well? Do you work > iteratively with them? For instance, start with the using hints from more > distant taxa (i.e. CEGMA) and then work your way closer? Just throw > everything in at once and retrain after that? > > Thanks in advance for any advice and insight. > > Cheers, > Nate > > > *Nathaniel Jue, Ph.D.* > Department of Molecular and Cell Biology > University of Connecticut > Storrs, CT 06269 > > [image: LinkedIn] > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 8 10:31:40 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 08 Jul 2014 10:31:40 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Convert them both to ZFF, then concatenate the ZFF and sequence files. --Carson From: Nathaniel Jue Date: Tuesday, July 8, 2014 at 9:56 AM To: Carson Holt Cc: Daniel Ence , "" Subject: Re: [maker-devel] Couple quick questions about Maker Carson, one more question: Any suggestions on how to combine the cegma and maker est2genome/protein2genome results? Can I just concatenate and sort the gff files or are there specific formating issues I need to consider? No overlapping regions or something like that? Thanks, Nate Nathaniel Jue, Ph.D. Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269 On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: > I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff > option (which is pretty different). Also If you provide GFF# files for > repeats, you will still need to turn of repeat masking in the control files by > blanking out the options. Also MAKER uses a step called RepeatRunner against > an internal transposable element protein databases which is probably still > running (and is slow because it's a search in translated protein space). > > For performance, you may want to give a larger max_dna_len for the MAKER run > given that you have a large RAM machine. Also set all the depth_blast in > maker_bopts.ctl to 15 or 20. > > CEGMA is convenient for training predictors because it finds genes that will > always be in every eukaryote (I.e. high confidence). You can combine these > with est2genome/protein2genome results from MAKER if you want. You can then > use the resulting HMM for a larger MAKER run with experimental evidence, and > then train again on those results. But beware than there is rarely any > benefit from training beyond that second round. More training actually tends > to makes things worse (the overtraining paradox). > > --Carson > > > > From: Daniel Ence > Date: Monday, July 7, 2014 at 10:00 AM > To: Nathaniel Jue > Cc: "" > Subject: Re: [maker-devel] Couple quick questions about Maker > > Hi Nathaniel, > > 1) We'll need to see the error messages that MAKER was giving to understand > what might have gone wrong with the Repeat Masker gff3 file. If you could run > maker on one of your scaffolds with your current settings and send us the > complete output, we can start to figure out what happened. > > 2) MAKER interacts with its gene predictors (augustus, snap, and the other > ones listed in the control files) in a way that improves their performance > (with the hints and such). When you supply predictions through the pred_gff > parameter, MAKER can't give that performance improvement, so there's something > of a tradeoff there. I think the performance improvement is a key part of > MAKER's success, so I would definitely recommend running the ab-initio tools > internally. > > MAKER tries to save you time by saving results from run to run and only > rerunning tools (usually blast tools) that had their parameters changed in the > control files. Taking advantage of that will probably be the biggest time > saver for you. Something else that could save you almost as much time would be > to set a reasonable lower-bound on the size of contigs that maker will try to > annotate (usually <5kbp or <10kbp depending on your genome). This parameter is > set with the min_contig parameter. > > I'll have to check with my lab mates about the Repeat ORF searching and how > they use CEGMA results. I think you can probably just put them all in there at > once though. > > ~Daniel > > > > > Daniel Ence > Graduate Student > dence at genetics.utah.edu > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > On Jul 7, 2014, at 9:26 AM, Nathaniel Jue > wrote: > >> Hi, >> >> I'm trying to run maker on a couple genomes right now and was wondering if >> folks had any thoughts on way to speed it up a bit. I'm running it on a >> 48-processor supercomputer (lots of RAM, usually use it for genome assembly). >> Both these genomes are a little fragmented, so there are lots of contigs, >> which slows down the whole process. I am looking for ways to speed things up >> and was wondering about a couple things: >> >> 1) I'm currently just at the first round of maker predictions using EST and >> protein evidence to build models. Had already done RepeatMasking so thought >> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so >> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the >> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF, >> should generally allow the program to bypass the RepeatMasking step, correct? >> Does it also make it bypass the Repeat ORF searching step? >> >> 2) I plan to run both SNAP and Augustus on these genomes as well. The >> two-step SNAP training from the tutorials seems straightforward, but I was >> wondering about the Augustus step. From what I can tell, simply providing an >> Augustus "trained" species name should turn on Augustus and blast/blat-like >> hints generated within Maker are then used in gene prediction. Any thoughts >> on if it's either more accurate or faster to do the Augustus predictions >> outside of the Maker pipeline and then import them using the pred_gff >> parameter in the maker_opts file? >> >> 3) Finally, I noticed that you had a script for converting cegma gff files to >> zff file for snap training? Currently, I am using predicted transcript for >> this species and protein sequences from related species to training. Does >> anyone have any insight into using CEGMA results as well? Do you work >> iteratively with them? For instance, start with the using hints from more >> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >> everything in at once and retrain after that? >> >> Thanks in advance for any advice and insight. >> >> Cheers, >> Nate >> >> >> Nathaniel Jue, Ph.D. >> Department of Molecular and Cell Biology >> University of Connecticut >> Storrs, CT 06269 >> >> >> > niel-jue%2F1%2F531%2F176%2F&sn=> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From saad.arif at tuebingen.mpg.de Mon Jul 7 07:08:53 2014 From: saad.arif at tuebingen.mpg.de (Saad Arif) Date: Mon, 7 Jul 2014 15:08:53 +0200 Subject: [maker-devel] Help with updating an annotation In-Reply-To: References: <2176827D-6E54-4951-B01E-CDAC15DB3A2E@tuebingen.mpg.de> <997CF579-F509-4699-A366-30D53AD6281E@genetics.utah.edu> <90F65756-74A0-4D2F-A49F-42C5EDDB25E9@tuebingen.mpg.de> Message-ID: Thanks for this. Would the following protocol be appropriate then, given i want to augment and merge an existing annotation with any novel genes: i) Run MAKER pipeline iteratively to generate an HMM for SNAP using my new RNAseq data and protein fastas from closely related organisms (with esttogenome and proteintogenome options on). ii) Turn off esttoGenome and proteintoGenome options and run Maker with my RNAseq evidence, protein fastas, SNAP HMM and my current annotation as model_GFF. Thanks in advance for any input. best, Saad On 20 Jun 2014, at 23:42, Carson Holt wrote: > "I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)?" > > Not exactly. You need to supply an HMM for SNAP or species file for Augusutus, etc. MAKER doesn't generate gene predictions, SNAP does. You cannot get updated models unless you've provided a way for those models to be updated. MAKER will provide SNAP/Augustus with hints to make them perform better based on the evidence, but those hints won't even be genertated and the programs won't even run unless you provide the HMM. Also if you provide models in gff3 format to pred_gff, there is not hint feedback (because there is no program to receive the hints - just an immutable GFF3 file). > > If you don't have an HMM for SNAP for your organism, you can generate one using the documentation here (from GMOD 2014 tutorial) --> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > > --Carson > > > > From: Saad Arif > Date: Wednesday, June 18, 2014 at 10:42 AM > To: Daniel Ence > Cc: "" > Subject: Re: [maker-devel] Help with updating an annotation > > Thanks Daniel. I think it's more clear to me now. > > So If I understand correctly now: I have to specify an ab initio gene model for any locus that I wish to annotate using evidence alignment (i.e. there must be a preexisting model)? These ab initio gene models can be trained internally in Maker with SNAP using my cufflinks output as EST evidence.Alternatively, I can provide alternative ab inito predictions (for regions not present in my ensembl ref passed to model_GFF) for regions overlapping my cufflinks output via the pred_GFF option? > > Since i'm interested in unannotated regions, i'm also passing in reference proteomes of closely related species as protein homology evidence. > > As such i should be able to keep, only evidence supported predictions (for regions not present in my model_GFF and or better supported models for present regions) from my pred_GFF and merge them with Ensembl annotations from the model_GFF? > > Let me know if i'm still missing something here. > > Thanks in advance. > > best, > Saad > On 18 Jun 2014, at 17:21, Daniel Ence wrote: > >> Hi Saad, >> >> Maker doesn't view EST or protein evidence as a gene model in themselves. There's a good reason for this. Aligners like blast don't guarantee complete gene models, with accurate start and stop codons and splice sites. With it's default settings maker won't make a gene model unless there's evidence that overlaps an ab-initio prediction (or something from the pred_gff option). >> >> You can use est2genome to promote everything from the est_gff option to a gene model, but this will probably give you many spurious results. What you're saying with est2genome is, "Everything that this tool found is a complete gene model." I don't think that's true even for cufflinks output. >> >> One of the gene predictors that can run internally is snap. It's really easy to train; here's a link to a tutorial for training it: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> >> Let me know if that helps, or if you have more question >> >> >> ~Daniel >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jun 18, 2014, at 5:09 AM, Saad Arif >> wrote: >> >>> Thank you for the response. I still have one question though, with these options: >>> >>> est_GFF=cufflinksout.GFF >>> >>> modle_GFF= ensembl reference.GFF >>> >>> What happens to cufflinks assembled transcripts that are not confined to current gene loci (i.e. novel genes in cufflinks ouput)? Would i have to prepare ab initio gene predictions for each of these predicted 'new' genes? >>> Is there a simple way to combine adding (new genes) and improving of an existing annotation? >>> >>> Any feedback on this would be greatly appreciated. >>> >>> saad >>> >>> On 13 Jun 2014, at 17:59, Carson Holt wrote: >>> >>>> Use the cufflinks instead of the tophat features (tophat tends to be >>>> really noisy). Give the existing models to model_gff (they will then >>>> always be kept unless something better is found). There is no option to >>>> keep models and then just add isoforms. The model_gff input will either >>>> be kept as is (unchanged), or replaced with an updated model suggested by >>>> the evidence (the updated model may contain multiple isoforms though), and >>>> map_forward=1 can be used to pull names forward from the old model onto >>>> the new models. >>>> >>>> Thansk, >>>> Carson >>>> >>>> >>>> On 6/13/14, 5:03 AM, "Saad Arif" wrote: >>>> >>>>> Dear All, >>>>> >>>>> I would like to use Maker pipeline to expand a current annotation (new >>>>> isoforms and novel genes with respect to current annotation) and was >>>>> wondering if anyone had experience with this and or suggestions to my >>>>> questions. >>>>> >>>>> Briefly: >>>>> >>>>> I have tophat splice junctions from RNAseq data or alternatively >>>>> cufflinks generated transcript models (fasts format) that i want to use >>>>> as my new data (est_gff or est). >>>>> >>>>> I want to provide the current Ensembl annotation for gene prediction but >>>>> i want this annotation to remain unchanged. Hence, i?m not sure if i >>>>> should provide this annotation as pred_gff >>>>> or model_gff. Can the model_gff be used for gene prediction or is this >>>>> just a subset of pred_gff that remain unaltered? Can we provide the same >>>>> annotation for both options (pred_ and mod_gff)? >>>>> >>>>> >>>>> >>>>> Importantly, my main goal is to use the new RNAseq data to add more >>>>> isoforms and (any) novel genes to the existing Ensembl annotation. Any >>>>> thoughts or suggestions on how to go about this would be sincerely >>>>> appreciated. >>>>> >>>>> >>>>> Thanks in advance, >>>>> saad >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 15:38:52 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:38:52 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 10 15:44:48 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 10 Jul 2014 15:44:48 -0600 Subject: [maker-devel] Couple quick questions about Maker In-Reply-To: References: <00469BE3-C2F8-45BB-8357-EC47CE0D7376@genetics.utah.edu> Message-ID: Also you can use repeat_gff in the control files, by I prefer just to rerun in the same directory as the previous job. --Carson From: Carson Holt Date: Thursday, July 10, 2014 at 3:38 PM To: Nathaniel Jue Cc: "" Subject: Re: [maker-devel] Couple quick questions about Maker Just rerun in the same directory and it will reuse the old reports, so it won't have to rerun RepeatMasker etc. --Carson From: Nathaniel Jue Date: Thursday, July 10, 2014 at 3:36 PM To: Carson Holt Subject: Re: [maker-devel] Couple quick questions about Maker Is there a way to by-pass the repeat prediction after it's done the first time? Seems like when I went to re-do the snap training, it's decided to re-do the repeatmasking as well. Does it always do that? Maybe I'm misinterpreting something? If there is a way to give it a maker generated gff to by-pass that step could you tell me where to find it? Thanks, Nate On Tue, Jul 8, 2014 at 12:31 PM, Carson Holt wrote: > Convert them both to ZFF, then concatenate the ZFF and sequence files. > > --Carson > > > From: Nathaniel Jue > Date: Tuesday, July 8, 2014 at 9:56 AM > To: Carson Holt > Cc: Daniel Ence , "" > > > Subject: Re: [maker-devel] Couple quick questions about Maker > > Carson, one more question: Any suggestions on how to combine the cegma and > maker est2genome/protein2genome results? Can I just concatenate and sort the > gff files or are there specific formating issues I need to consider? No > overlapping regions or something like that? > > Thanks, > Nate > > > Nathaniel Jue, Ph.D. > > Department of Molecular and Cell Biology > > University of Connecticut > > Storrs, CT 06269 > > > > iel-jue%2F1%2F531%2F176%2F&sn=> > > > On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt wrote: >> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the -gff >> option (which is pretty different). Also If you provide GFF# files for >> repeats, you will still need to turn of repeat masking in the control files >> by blanking out the options. Also MAKER uses a step called RepeatRunner >> against an internal transposable element protein databases which is probably >> still running (and is slow because it's a search in translated protein >> space). >> >> For performance, you may want to give a larger max_dna_len for the MAKER run >> given that you have a large RAM machine. Also set all the depth_blast in >> maker_bopts.ctl to 15 or 20. >> >> CEGMA is convenient for training predictors because it finds genes that will >> always be in every eukaryote (I.e. high confidence). You can combine these >> with est2genome/protein2genome results from MAKER if you want. You can then >> use the resulting HMM for a larger MAKER run with experimental evidence, and >> then train again on those results. But beware than there is rarely any >> benefit from training beyond that second round. More training actually tends >> to makes things worse (the overtraining paradox). >> >> --Carson >> >> >> >> From: Daniel Ence >> Date: Monday, July 7, 2014 at 10:00 AM >> To: Nathaniel Jue >> Cc: "" >> Subject: Re: [maker-devel] Couple quick questions about Maker >> >> Hi Nathaniel, >> >> 1) We'll need to see the error messages that MAKER was giving to understand >> what might have gone wrong with the Repeat Masker gff3 file. If you could run >> maker on one of your scaffolds with your current settings and send us the >> complete output, we can start to figure out what happened. >> >> 2) MAKER interacts with its gene predictors (augustus, snap, and the other >> ones listed in the control files) in a way that improves their performance >> (with the hints and such). When you supply predictions through the pred_gff >> parameter, MAKER can't give that performance improvement, so there's >> something of a tradeoff there. I think the performance improvement is a key >> part of MAKER's success, so I would definitely recommend running the >> ab-initio tools internally. >> >> MAKER tries to save you time by saving results from run to run and only >> rerunning tools (usually blast tools) that had their parameters changed in >> the control files. Taking advantage of that will probably be the biggest time >> saver for you. Something else that could save you almost as much time would >> be to set a reasonable lower-bound on the size of contigs that maker will try >> to annotate (usually <5kbp or <10kbp depending on your genome). This >> parameter is set with the min_contig parameter. >> >> I'll have to check with my lab mates about the Repeat ORF searching and how >> they use CEGMA results. I think you can probably just put them all in there >> at once though. >> >> ~Daniel >> >> >> >> >> Daniel Ence >> Graduate Student >> dence at genetics.utah.edu >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue >> wrote: >> >>> Hi, >>> >>> I'm trying to run maker on a couple genomes right now and was wondering if >>> folks had any thoughts on way to speed it up a bit. I'm running it on a >>> 48-processor supercomputer (lots of RAM, usually use it for genome >>> assembly). Both these genomes are a little fragmented, so there are lots of >>> contigs, which slows down the whole process. I am looking for ways to speed >>> things up and was wondering about a couple things: >>> >>> 1) I'm currently just at the first round of maker predictions using EST and >>> protein evidence to build models. Had already done RepeatMasking so thought >>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, >>> so two questions: i) any thoughts on why that GFF wasn't acceptable? It's >>> the one that repeatmasker outputs if you ask it to; and ii) Providing this >>> GFF, should generally allow the program to bypass the RepeatMasking step, >>> correct? Does it also make it bypass the Repeat ORF searching step? >>> >>> 2) I plan to run both SNAP and Augustus on these genomes as well. The >>> two-step SNAP training from the tutorials seems straightforward, but I was >>> wondering about the Augustus step. From what I can tell, simply providing an >>> Augustus "trained" species name should turn on Augustus and blast/blat-like >>> hints generated within Maker are then used in gene prediction. Any thoughts >>> on if it's either more accurate or faster to do the Augustus predictions >>> outside of the Maker pipeline and then import them using the pred_gff >>> parameter in the maker_opts file? >>> >>> 3) Finally, I noticed that you had a script for converting cegma gff files >>> to zff file for snap training? Currently, I am using predicted transcript >>> for this species and protein sequences from related species to training. >>> Does anyone have any insight into using CEGMA results as well? Do you work >>> iteratively with them? For instance, start with the using hints from more >>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw >>> everything in at once and retrain after that? >>> >>> Thanks in advance for any advice and insight. >>> >>> Cheers, >>> Nate >>> >>> >>> Nathaniel Jue, Ph.D. >>> Department of Molecular and Cell Biology >>> University of Connecticut >>> Storrs, CT 06269 >>> >>> >>> >> aniel-jue%2F1%2F531%2F176%2F&sn=> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 16:02:57 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 15:02:57 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup Message-ID: Hi experts, I am trying to install Maker in a new machine (running Mac OS 10.7.5), and have succeed so far except for the "./Build exonerate" step, which gives me the following error: checking for socklen_t... yes checking for pkg-config... no ERROR: Could not find pkg-config ... is glib-2 installed ??? Fink for 64-bit is installed, and via 'fink list', I confimed that glib2-dev and -shlibs are installed. I unistalled and re-installed both fink and glib2 several times, hoping it was a configuration problem, but still get stuck at this step. I found a few previous questions about this issue in this forum, but the solutions Carson provided were directed for OS 10.6 only, apparently, so I did not try these. I have run into the limit of what I know how to do with these compilations. I tried setting up Exonerate directly but it has trouble finding glib as well. Any suggestions? Thank you so much! -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fbarreto at ucsd.edu Thu Jul 10 17:41:59 2014 From: fbarreto at ucsd.edu (Felipe Barreto) Date: Thu, 10 Jul 2014 16:41:59 -0700 Subject: [maker-devel] Problem installing exonerate during Maker setup In-Reply-To: References: Message-ID: OK, before anyone spends too much of their time trying to help me... I think I was able to solve my issue above. What I did was to install an additional glib2-related package using fink install. I installed glibmm2.4-dev, which also installs glibmm2.4-shlib. These make up a C++ interface for the glib2 library, according to their description. Once I installed those packages, I re-ran ./Build exonerate and it seemed to work. I tried a exonerate command in Terminal and it recognized it OK. Hopefully what I did won't cause any issues down the line. Thanks. On Thu, Jul 10, 2014 at 3:02 PM, Felipe Barreto wrote: > Hi experts, > > I am trying to install Maker in a new machine (running Mac OS 10.7.5), and > have succeed so far except for the "./Build exonerate" step, which gives me > the following error: > > checking for socklen_t... yes > checking for pkg-config... no > ERROR: Could not find pkg-config ... is glib-2 installed ??? > > > Fink for 64-bit is installed, and via 'fink list', I confimed that > glib2-dev and -shlibs are installed. I unistalled and re-installed both > fink and glib2 several times, hoping it was a configuration problem, but > still get stuck at this step. > > I found a few previous questions about this issue in this forum, but the > solutions Carson provided were directed for OS 10.6 only, apparently, so I > did not try these. I have run into the limit of what I know how to do with > these compilations. > > I tried setting up Exonerate directly but it has trouble finding glib as > well. > > Any suggestions? > > Thank you so much! > -- > Felipe Barreto > Post-doctoral Scholar > Scripps Institution of Oceanography > University of California, San Diego > La Jolla, CA 92093 > -- Felipe Barreto Post-doctoral Scholar Scripps Institution of Oceanography University of California, San Diego La Jolla, CA 92093 -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 11 05:56:03 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 11 Jul 2014 13:56:03 +0200 Subject: [maker-devel] (no subject) Message-ID: I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Jul 11 08:08:43 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 11 Jul 2014 14:08:43 +0000 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Hi Panos, 1) You'll only use est2genome and protein2genome for creating models that will be used for training the ab-initio predictors (like SNAP). Sometimes that means one run of MAKER for training; sometimes that means two runs of MAKER. You usually don't gain any accuracy after the second round of training. It's ok to use both EST and protein data for this training step. 2) If you're using both ESTs and protein sequence to train your ab-initio predictors, then both est2genome and protein2genome should be set to 1. 3) If you want to pass Blast results to MAKER, you'll need to pass those results as GFF3, but MAKER will install and run blast for you, and does a good job of keeping track of all those results and making them accessible to you in the end, so it's going to be a lot of work to do those blasts on your own outside of MAKER. I seriously suggest that you use blast internal to maker. Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ________________________________ From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos Ioannidis [panos.ioannidis at gmail.com] Sent: Friday, July 11, 2014 5:56 AM To: maker-devel Subject: [maker-devel] (no subject) I got back to my annotations this past week and have a couple of questions! 1) Since my organism isn't closely related with any other that's already sequenced, I will have to run maker twice (according to the tutorial). So for the first run I see that some people use only the ESTs and some others use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that the ESTs will give better models, but for the cases where genes aren't covered by an EST, it's okay to have a protein database to detect them as well. Am I right? What do you think? 2) In case I use both ESTs and a protein database how should I set the est2genome and protein2genome parameters in the maker_opts.ctl file? Should they both equal to "1"? 3) I've been thinking of running the BLAST searches separately and giving Maker directly the results. I guess that in this case, I'll have to first convert the BLAST output to a gff3 file and give it to the protein_gff parameter, right? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 14 01:20:50 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 14 Jul 2014 09:20:50 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models > that will be used for training the ab-initio predictors (like SNAP). > Sometimes that means one run of MAKER for training; sometimes that means > two runs of MAKER. You usually don't gain any accuracy after the second > round of training. It's ok to use both EST and protein data for this > training step. > > 2) If you're using both ESTs and protein sequence to train your > ab-initio predictors, then both est2genome and protein2genome should be set > to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a > good job of keeping track of all those results and making them accessible > to you in the end, so it's going to be a lot of work to do those blasts on > your own outside of MAKER. I seriously suggest that you use blast internal > to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ------------------------------ > *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of > Panos Ioannidis [panos.ioannidis at gmail.com] > *Sent:* Friday, July 11, 2014 5:56 AM > *To:* maker-devel > *Subject:* [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of > questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So > for the first run I see that some people use only the ESTs and some others > use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess > that the ESTs will give better models, but for the cases where genes aren't > covered by an EST, it's okay to have a protein database to detect them as > well. Am I right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? > Should they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and > giving Maker directly the results. I guess that in this case, I'll have to > first convert the BLAST output to a gff3 file and give it to the > protein_gff parameter, right? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 08:46:50 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:46:50 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 14 08:49:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 14 Jul 2014 08:49:33 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Also one more question. What is the exact error text you get for the forks error? Is it a forks.pm error or is it an MPI warn on fork error (which are actually very different). --Carson From: Carson Holt Date: Monday, July 14, 2014 at 8:46 AM To: Panos Ioannidis , Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) If you do the BLAST's yourself the results could be dramatically worse. The filtering and polishing done by MAKER is rather significant (direct BLAST is actually worse with homology searches than many people realize). With respect to forks.pm, your admin most likely edited the wrong forks.pm. There may be more than one on your system. If you let maker install some prerequisites for you (because it requires a specific version of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify the exact location of the forks.pm being used. Or if he is editing it as part of the install tarball, his edits may actually be undone during the installation procedure. Use this command line to identify the location of the forks.pm module that would have to be edited --> maker --debug 2>&1 | grep "forks.pm" You can even send me a copy of the file once it has been edited, and I can tell you if it was done correctly. --Carson From: Panos Ioannidis Date: Monday, July 14, 2014 at 1:20 AM To: Daniel Ence Cc: maker-devel Subject: Re: [maker-devel] (no subject) Daniel, thanks for the info. Regarding (3), the only reason I think of running BLASTs separately is because I'm currently not able to run Maker on our cluster due to a problem in the Perl "forks" library. And it looks like there isn't much I can do about it; I tried Perlbrew but it doesn't work when I try to install versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin also tried to change the code in the forks.pm file as per Carson's suggestion in another thread, but that didn't work either... As a result I'm running Maker on my workstation (really slooow) till a solution is found and since BLAST is a time-consuming step I was thinking of running it separately. On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: > Hi Panos, > > 1) You'll only use est2genome and protein2genome for creating models that will > be used for training the ab-initio predictors (like SNAP). Sometimes that > means one run of MAKER for training; sometimes that means two runs of MAKER. > You usually don't gain any accuracy after the second round of training. It's > ok to use both EST and protein data for this training step. > > 2) If you're using both ESTs and protein sequence to train your ab-initio > predictors, then both est2genome and protein2genome should be set to 1. > > 3) If you want to pass Blast results to MAKER, you'll need to pass those > results as GFF3, but MAKER will install and run blast for you, and does a good > job of keeping track of all those results and making them accessible to you in > the end, so it's going to be a lot of work to do those blasts on your own > outside of MAKER. I seriously suggest that you use blast internal to maker. > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > > From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos > Ioannidis [panos.ioannidis at gmail.com] > Sent: Friday, July 11, 2014 5:56 AM > To: maker-devel > Subject: [maker-devel] (no subject) > > I got back to my annotations this past week and have a couple of questions! > > 1) Since my organism isn't closely related with any other that's already > sequenced, I will have to run maker twice (according to the tutorial). So for > the first run I see that some people use only the ESTs and some others use > ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that > the ESTs will give better models, but for the cases where genes aren't covered > by an EST, it's okay to have a protein database to detect them as well. Am I > right? What do you think? > > 2) In case I use both ESTs and a protein database how should I set the > est2genome and protein2genome parameters in the maker_opts.ctl file? Should > they both equal to "1"? > > 3) I've been thinking of running the BLAST searches separately and giving > Maker directly the results. I guess that in this case, I'll have to first > convert the BLAST output to a gff3 file and give it to the protein_gff > parameter, right? > > Thanks, > Panos _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m aker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 00:59:18 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 08:59:18 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm, because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the > forks error? Is it a forks.pm error or is it an MPI warn on fork error > (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence < > dence at genetics.utah.edu> > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. > The filtering and polishing done by MAKER is rather significant (direct > BLAST is actually worse with homology searches than many people realize). > > With respect to forks.pm, your admin most likely edited the wrong forks.pm. > There may be more than one on your system. If you let maker install some > prerequisites for you (because it requires a specific version of forks.pm), > it may be in .../maker/perl/lib/forks.pm. Otherwise you have to identify > the exact location of the forks.pm being used. Or if he is editing it as > part of the install tarball, his edits may actually be undone during the > installation procedure. > > Use this command line to identify the location of the forks.pm module > that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm" > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is > because I'm currently not able to run Maker on our cluster due to a problem > in the Perl "forks" library. And it looks like there isn't much I can do > about it; I tried Perlbrew but it doesn't work when I try to install > versions <5.18 (the problem in forks occurs on 5.18 and later versions). > Our admin also tried to change the code in the forks.pm file as per > Carson's suggestion in another thread, but that didn't work either... As a > result I'm running Maker on my workstation (really slooow) till a solution > is found and since BLAST is a time-consuming step I was thinking of running > it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence > wrote: > >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible >> to you in the end, so it's going to be a lot of work to do those blasts on >> your own outside of MAKER. I seriously suggest that you use blast internal >> to maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> ------------------------------ >> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >> Panos Ioannidis [panos.ioannidis at gmail.com] >> *Sent:* Friday, July 11, 2014 5:56 AM >> *To:* maker-devel >> *Subject:* [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of >> questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So >> for the first run I see that some people use only the ESTs and some others >> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >> that the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? >> Should they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos >> > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 07:58:20 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 07:58:20 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you are getting a segfault. It is more likely an MPI error especially if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that have bugs on forks and system calls. If it is OpenMPI, run the following command before launching MAKER --> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so Make sure to set replace openmpi_location with the location of your OpenMPI. Also add the following to your MPI command before running MAKER. --> -mca btl ^openib Example --> mpiexec -mca btl ^openib -n 40 maker If you are using MVAPICH2, then you need to switch to OpenMPI. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 12:59 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) I didn't know there are more than one forks.pm files! We'll give it another try later today. As for the error, it's just "Segmentation fault"! And we know this segfault is because of forks.pm , because if you remove the "use forks;" line script execution continues without segfault (till it crashes later for another reason, of course). In fact, even if you create a script with just the line "use forks;" and try to run it, you'll get a segfault. So it looks like it's something pretty general and serious, and I'm really surprised I can't find anything by googling (except your fix!)... On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > Also one more question. What is the exact error text you get for the forks > error? Is it a forks.pm error or is it an MPI warn on fork > error (which are actually very different). > > --Carson > > > From: Carson Holt > Date: Monday, July 14, 2014 at 8:46 AM > To: Panos Ioannidis , Daniel Ence > > > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > If you do the BLAST's yourself the results could be dramatically worse. The > filtering and polishing done by MAKER is rather significant (direct BLAST is > actually worse with homology searches than many people realize). > > With respect to forks.pm , your admin most likely edited the > wrong forks.pm . There may be more than one on your system. > If you let maker install some prerequisites for you (because it requires a > specific version of forks.pm ), it may be in > .../maker/perl/lib/forks.pm . Otherwise you have to > identify the exact location of the forks.pm being used. Or > if he is editing it as part of the install tarball, his edits may actually be > undone during the installation procedure. > > Use this command line to identify the location of the forks.pm > module that would have to be edited --> > maker --debug 2>&1 | grep "forks.pm " > > You can even send me a copy of the file once it has been edited, and I can > tell you if it was done correctly. > > --Carson > > > > > From: Panos Ioannidis > Date: Monday, July 14, 2014 at 1:20 AM > To: Daniel Ence > Cc: maker-devel > Subject: Re: [maker-devel] (no subject) > > Daniel, thanks for the info. > > Regarding (3), the only reason I think of running BLASTs separately is because > I'm currently not able to run Maker on our cluster due to a problem in the > Perl "forks" library. And it looks like there isn't much I can do about it; I > tried Perlbrew but it doesn't work when I try to install versions <5.18 (the > problem in forks occurs on 5.18 and later versions). Our admin also tried to > change the code in the forks.pm file as per Carson's > suggestion in another thread, but that didn't work either... As a result I'm > running Maker on my workstation (really slooow) till a solution is found and > since BLAST is a time-consuming step I was thinking of running it separately. > > > On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >> Hi Panos, >> >> 1) You'll only use est2genome and protein2genome for creating models that >> will be used for training the ab-initio predictors (like SNAP). Sometimes >> that means one run of MAKER for training; sometimes that means two runs of >> MAKER. You usually don't gain any accuracy after the second round of >> training. It's ok to use both EST and protein data for this training step. >> >> 2) If you're using both ESTs and protein sequence to train your ab-initio >> predictors, then both est2genome and protein2genome should be set to 1. >> >> 3) If you want to pass Blast results to MAKER, you'll need to pass those >> results as GFF3, but MAKER will install and run blast for you, and does a >> good job of keeping track of all those results and making them accessible to >> you in the end, so it's going to be a lot of work to do those blasts on your >> own outside of MAKER. I seriously suggest that you use blast internal to >> maker. >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >> Ioannidis [panos.ioannidis at gmail.com] >> Sent: Friday, July 11, 2014 5:56 AM >> To: maker-devel >> Subject: [maker-devel] (no subject) >> >> I got back to my annotations this past week and have a couple of questions! >> >> 1) Since my organism isn't closely related with any other that's already >> sequenced, I will have to run maker twice (according to the tutorial). So for >> the first run I see that some people use only the ESTs and some others use >> ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess that >> the ESTs will give better models, but for the cases where genes aren't >> covered by an EST, it's okay to have a protein database to detect them as >> well. Am I right? What do you think? >> >> 2) In case I use both ESTs and a protein database how should I set the >> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >> they both equal to "1"? >> >> 3) I've been thinking of running the BLAST searches separately and giving >> Maker directly the results. I guess that in this case, I'll have to first >> convert the BLAST output to a gff3 file and give it to the protein_gff >> parameter, right? >> >> Thanks, >> Panos > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak > er-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Jul 15 08:03:12 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 15 Jul 2014 16:03:12 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially > if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries > that have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your > OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll give it > another try later today. > > As for the error, it's just "Segmentation fault"! And we know this > segfault is because of forks.pm, because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just > the line "use forks;" and try to run it, you'll get a segfault. So it looks > like it's something pretty general and serious, and I'm really surprised I > can't find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: > >> Also one more question. What is the exact error text you get for the >> forks error? Is it a forks.pm error or is it an MPI warn on fork error >> (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence < >> dence at genetics.utah.edu> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. >> The filtering and polishing done by MAKER is rather significant (direct >> BLAST is actually worse with homology searches than many people realize). >> >> With respect to forks.pm, your admin most likely edited the wrong >> forks.pm. There may be more than one on your system. If you let maker >> install some prerequisites for you (because it requires a specific version >> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >> have to identify the exact location of the forks.pm being used. Or if he >> is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm module >> that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm" >> >> You can even send me a copy of the file once it has been edited, and I >> can tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install >> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >> Our admin also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution >> is found and since BLAST is a time-consuming step I was thinking of running >> it separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >> wrote: >> >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models >>> that will be used for training the ab-initio predictors (like SNAP). >>> Sometimes that means one run of MAKER for training; sometimes that means >>> two runs of MAKER. You usually don't gain any accuracy after the second >>> round of training. It's ok to use both EST and protein data for this >>> training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your >>> ab-initio predictors, then both est2genome and protein2genome should be set >>> to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible >>> to you in the end, so it's going to be a lot of work to do those blasts on >>> your own outside of MAKER. I seriously suggest that you use blast internal >>> to maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> ------------------------------ >>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>> Panos Ioannidis [panos.ioannidis at gmail.com] >>> *Sent:* Friday, July 11, 2014 5:56 AM >>> *To:* maker-devel >>> *Subject:* [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of >>> questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>> Should they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and >>> giving Maker directly the results. I guess that in this case, I'll have to >>> first convert the BLAST output to a gff3 file and give it to the >>> protein_gff parameter, right? >>> >>> Thanks, >>> Panos >>> >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 15 08:10:24 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 15 Jul 2014 08:10:24 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: If you don't have MPI support, it's not an issue, and your Seg fault is likely something else. Your reference to perl 5.18 and forks.pm should not be a segfault error either, and would not represent your error. The Perl 5.18/forks.pm is a different issue where perl actually tells itself to die because hash reshuffling isn't safe whereas segfaults are causes by binary corruption or incorrect memory access issues (very different issues). I'd actually recommend a full perl reinstall if you are getting segfaults, because it suggests a deeper issue. --Carson From: Panos Ioannidis Date: Tuesday, July 15, 2014 at 8:03 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Carson, many thanks for the info! I haven't installed Maker with MPI support. Is this segfault only occurring when you install it with MPI support? On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > If you are getting a segfault. It is more likely an MPI error especially if > you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that > have bugs on forks and system calls. > > If it is OpenMPI, run the following command before launching MAKER --> > export LD_PRELOAD=?/openmpi_location/lib/libmpi.so > > Make sure to set replace openmpi_location with the location of your OpenMPI. > > Also add the following to your MPI command before running MAKER. > --> -mca btl ^openib > Example --> mpiexec -mca btl ^openib -n 40 maker > > > If you are using MVAPICH2, then you need to switch to OpenMPI. > > --Carson > > > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 12:59 AM > To: Carson Holt > Cc: Daniel Ence , maker-devel > > > Subject: Re: [maker-devel] (no subject) > > I didn't know there are more than one forks.pm files! We'll > give it another try later today. > > As for the error, it's just "Segmentation fault"! And we know this segfault is > because of forks.pm , because if you remove the "use forks;" > line script execution continues without segfault (till it crashes later for > another reason, of course). In fact, even if you create a script with just the > line "use forks;" and try to run it, you'll get a segfault. So it looks like > it's something pretty general and serious, and I'm really surprised I can't > find anything by googling (except your fix!)... > > > On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> Also one more question. What is the exact error text you get for the forks >> error? Is it a forks.pm error or is it an MPI warn on >> fork error (which are actually very different). >> >> --Carson >> >> >> From: Carson Holt >> Date: Monday, July 14, 2014 at 8:46 AM >> To: Panos Ioannidis , Daniel Ence >> >> >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> If you do the BLAST's yourself the results could be dramatically worse. The >> filtering and polishing done by MAKER is rather significant (direct BLAST is >> actually worse with homology searches than many people realize). >> >> With respect to forks.pm , your admin most likely edited >> the wrong forks.pm . There may be more than one on your >> system. If you let maker install some prerequisites for you (because it >> requires a specific version of forks.pm ), it may be in >> .../maker/perl/lib/forks.pm . Otherwise you have to >> identify the exact location of the forks.pm being used. Or >> if he is editing it as part of the install tarball, his edits may actually be >> undone during the installation procedure. >> >> Use this command line to identify the location of the forks.pm >> module that would have to be edited --> >> maker --debug 2>&1 | grep "forks.pm " >> >> You can even send me a copy of the file once it has been edited, and I can >> tell you if it was done correctly. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Monday, July 14, 2014 at 1:20 AM >> To: Daniel Ence >> Cc: maker-devel >> Subject: Re: [maker-devel] (no subject) >> >> Daniel, thanks for the info. >> >> Regarding (3), the only reason I think of running BLASTs separately is >> because I'm currently not able to run Maker on our cluster due to a problem >> in the Perl "forks" library. And it looks like there isn't much I can do >> about it; I tried Perlbrew but it doesn't work when I try to install versions >> <5.18 (the problem in forks occurs on 5.18 and later versions). Our admin >> also tried to change the code in the forks.pm file as per >> Carson's suggestion in another thread, but that didn't work either... As a >> result I'm running Maker on my workstation (really slooow) till a solution is >> found and since BLAST is a time-consuming step I was thinking of running it >> separately. >> >> >> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence wrote: >>> Hi Panos, >>> >>> 1) You'll only use est2genome and protein2genome for creating models that >>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>> that means one run of MAKER for training; sometimes that means two runs of >>> MAKER. You usually don't gain any accuracy after the second round of >>> training. It's ok to use both EST and protein data for this training step. >>> >>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>> predictors, then both est2genome and protein2genome should be set to 1. >>> >>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>> results as GFF3, but MAKER will install and run blast for you, and does a >>> good job of keeping track of all those results and making them accessible to >>> you in the end, so it's going to be a lot of work to do those blasts on your >>> own outside of MAKER. I seriously suggest that you use blast internal to >>> maker. >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>> Ioannidis [panos.ioannidis at gmail.com] >>> Sent: Friday, July 11, 2014 5:56 AM >>> To: maker-devel >>> Subject: [maker-devel] (no subject) >>> >>> I got back to my annotations this past week and have a couple of questions! >>> >>> 1) Since my organism isn't closely related with any other that's already >>> sequenced, I will have to run maker twice (according to the tutorial). So >>> for the first run I see that some people use only the ESTs and some others >>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>> that the ESTs will give better models, but for the cases where genes aren't >>> covered by an EST, it's okay to have a protein database to detect them as >>> well. Am I right? What do you think? >>> >>> 2) In case I use both ESTs and a protein database how should I set the >>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>> they both equal to "1"? >>> >>> 3) I've been thinking of running the BLAST searches separately and giving >>> Maker directly the results. I guess that in this case, I'll have to first >>> convert the BLAST output to a gff3 file and give it to the protein_gff >>> parameter, right? >>> >>> Thanks, >>> Panos >> >> _______________________________________________ maker-devel mailing list >> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/ma >> ker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Wed Jul 16 06:26:56 2014 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Wed, 16 Jul 2014 14:26:56 +0200 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm should > not be a segfault error either, and would not represent your error. The > Perl 5.18/forks.pm is a different issue where perl actually tells itself > to die because hash reshuffling isn't safe whereas segfaults are causes by > binary corruption or incorrect memory access issues (very different > issues). I'd actually recommend a full perl reinstall if you are getting > segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel < > maker-devel at yandell-lab.org> > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only > occurring when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: > >> If you are getting a segfault. It is more likely an MPI error especially >> if you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries >> that have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your >> OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel < >> maker-devel at yandell-lab.org> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! We'll give it >> another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this >> segfault is because of forks.pm, because if you remove the "use forks;" >> line script execution continues without segfault (till it crashes later for >> another reason, of course). In fact, even if you create a script with just >> the line "use forks;" and try to run it, you'll get a segfault. So it looks >> like it's something pretty general and serious, and I'm really surprised I >> can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >> >>> Also one more question. What is the exact error text you get for the >>> forks error? Is it a forks.pm error or is it an MPI warn on fork error >>> (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence < >>> dence at genetics.utah.edu> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. >>> The filtering and polishing done by MAKER is rather significant (direct >>> BLAST is actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm, your admin most likely edited the wrong >>> forks.pm. There may be more than one on your system. If you let maker >>> install some prerequisites for you (because it requires a specific version >>> of forks.pm), it may be in .../maker/perl/lib/forks.pm. Otherwise you >>> have to identify the exact location of the forks.pm being used. Or if >>> he is editing it as part of the install tarball, his edits may actually be >>> undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm module >>> that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm" >>> >>> You can even send me a copy of the file once it has been edited, and I >>> can tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). >>> Our admin also tried to change the code in the forks.pm file as per >>> Carson's suggestion in another thread, but that didn't work either... As a >>> result I'm running Maker on my workstation (really slooow) till a solution >>> is found and since BLAST is a time-consuming step I was thinking of running >>> it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>> >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models >>>> that will be used for training the ab-initio predictors (like SNAP). >>>> Sometimes that means one run of MAKER for training; sometimes that means >>>> two runs of MAKER. You usually don't gain any accuracy after the second >>>> round of training. It's ok to use both EST and protein data for this >>>> training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your >>>> ab-initio predictors, then both est2genome and protein2genome should be set >>>> to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass >>>> those results as GFF3, but MAKER will install and run blast for you, and >>>> does a good job of keeping track of all those results and making them >>>> accessible to you in the end, so it's going to be a lot of work to do those >>>> blasts on your own outside of MAKER. I seriously suggest that you use blast >>>> internal to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> ------------------------------ >>>> *From:* maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>>> Panos Ioannidis [panos.ioannidis at gmail.com] >>>> *Sent:* Friday, July 11, 2014 5:56 AM >>>> *To:* maker-devel >>>> *Subject:* [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of >>>> questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's >>>> already sequenced, I will have to run maker twice (according to the >>>> tutorial). So for the first run I see that some people use only the ESTs >>>> and some others use ESTs and a protein database (CEGMA, Uniref50, >>>> Swiss-Prot, etc). I guess that the ESTs will give better models, but for >>>> the cases where genes aren't covered by an EST, it's okay to have a protein >>>> database to detect them as well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? >>>> Should they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and >>>> giving Maker directly the results. I guess that in this case, I'll have to >>>> first convert the BLAST output to a gff3 file and give it to the >>>> protein_gff parameter, right? >>>> >>>> Thanks, >>>> Panos >>>> >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 16 08:04:55 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 08:04:55 -0600 Subject: [maker-devel] (no subject) In-Reply-To: References: Message-ID: You don't have to do a system wide install. It is incredibly easy to have multiple installations of Perl. Perlbrew for example makes it easy to install and switch between multiple versions rapidly (and doesn't affect the system install) --> http://perlbrew.pl You can then test. The perl installation used by different programs is determined by the '#!' header in the executable script and not by the default location of your system's perl (look at the first line in .../maker/bin/maker and you will see what I mean). This value gets set during the initial installation, and whatever perl path you use to run MAKER's Build.PL script will end up being the one used to run MAKER, even if the system perl is different. --Carson From: Panos Ioannidis Date: Wednesday, July 16, 2014 at 6:26 AM To: Carson Holt Cc: Daniel Ence , maker-devel Subject: Re: [maker-devel] (no subject) Thanks for the help guys. All really helpful! Unfortunately, full Perl reinstall is not an option at this moment on this machine, since it's used by others in our group and they need it running... I'll try to find another solution with our admin. On Tue, Jul 15, 2014 at 4:10 PM, Carson Holt wrote: > If you don't have MPI support, it's not an issue, and your Seg fault is > likely something else. Your reference to perl 5.18 and forks.pm > should not be a segfault error either, and would not > represent your error. The Perl 5.18/forks.pm is a different > issue where perl actually tells itself to die because hash reshuffling isn't > safe whereas segfaults are causes by binary corruption or incorrect memory > access issues (very different issues). I'd actually recommend a full perl > reinstall if you are getting segfaults, because it suggests a deeper issue. > > --Carson > > > From: Panos Ioannidis > Date: Tuesday, July 15, 2014 at 8:03 AM > > To: Carson Holt > Cc: Daniel Ence , maker-devel > > Subject: Re: [maker-devel] (no subject) > > Carson, many thanks for the info! > > I haven't installed Maker with MPI support. Is this segfault only occurring > when you install it with MPI support? > > > On Tue, Jul 15, 2014 at 3:58 PM, Carson Holt wrote: >> If you are getting a segfault. It is more likely an MPI error especially if >> you are using OpenMPI or MVAPICH2. They both use OpenFabrics libraries that >> have bugs on forks and system calls. >> >> If it is OpenMPI, run the following command before launching MAKER --> >> export LD_PRELOAD=?/openmpi_location/lib/libmpi.so >> >> Make sure to set replace openmpi_location with the location of your OpenMPI. >> >> Also add the following to your MPI command before running MAKER. >> --> -mca btl ^openib >> Example --> mpiexec -mca btl ^openib -n 40 maker >> >> >> If you are using MVAPICH2, then you need to switch to OpenMPI. >> >> --Carson >> >> >> >> >> From: Panos Ioannidis >> Date: Tuesday, July 15, 2014 at 12:59 AM >> To: Carson Holt >> Cc: Daniel Ence , maker-devel >> >> >> Subject: Re: [maker-devel] (no subject) >> >> I didn't know there are more than one forks.pm files! >> We'll give it another try later today. >> >> As for the error, it's just "Segmentation fault"! And we know this segfault >> is because of forks.pm , because if you remove the "use >> forks;" line script execution continues without segfault (till it crashes >> later for another reason, of course). In fact, even if you create a script >> with just the line "use forks;" and try to run it, you'll get a segfault. So >> it looks like it's something pretty general and serious, and I'm really >> surprised I can't find anything by googling (except your fix!)... >> >> >> On Mon, Jul 14, 2014 at 4:49 PM, Carson Holt wrote: >>> Also one more question. What is the exact error text you get for the forks >>> error? Is it a forks.pm error or is it an MPI warn on >>> fork error (which are actually very different). >>> >>> --Carson >>> >>> >>> From: Carson Holt >>> Date: Monday, July 14, 2014 at 8:46 AM >>> To: Panos Ioannidis , Daniel Ence >>> >>> >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> If you do the BLAST's yourself the results could be dramatically worse. The >>> filtering and polishing done by MAKER is rather significant (direct BLAST is >>> actually worse with homology searches than many people realize). >>> >>> With respect to forks.pm , your admin most likely edited >>> the wrong forks.pm . There may be more than one on your >>> system. If you let maker install some prerequisites for you (because it >>> requires a specific version of forks.pm ), it may be in >>> .../maker/perl/lib/forks.pm . Otherwise you have to >>> identify the exact location of the forks.pm being used. >>> Or if he is editing it as part of the install tarball, his edits may >>> actually be undone during the installation procedure. >>> >>> Use this command line to identify the location of the forks.pm >>> module that would have to be edited --> >>> maker --debug 2>&1 | grep "forks.pm " >>> >>> You can even send me a copy of the file once it has been edited, and I can >>> tell you if it was done correctly. >>> >>> --Carson >>> >>> >>> >>> >>> From: Panos Ioannidis >>> Date: Monday, July 14, 2014 at 1:20 AM >>> To: Daniel Ence >>> Cc: maker-devel >>> Subject: Re: [maker-devel] (no subject) >>> >>> Daniel, thanks for the info. >>> >>> Regarding (3), the only reason I think of running BLASTs separately is >>> because I'm currently not able to run Maker on our cluster due to a problem >>> in the Perl "forks" library. And it looks like there isn't much I can do >>> about it; I tried Perlbrew but it doesn't work when I try to install >>> versions <5.18 (the problem in forks occurs on 5.18 and later versions). Our >>> admin also tried to change the code in the forks.pm file >>> as per Carson's suggestion in another thread, but that didn't work either... >>> As a result I'm running Maker on my workstation (really slooow) till a >>> solution is found and since BLAST is a time-consuming step I was thinking of >>> running it separately. >>> >>> >>> On Fri, Jul 11, 2014 at 4:08 PM, Daniel Ence >>> wrote: >>>> Hi Panos, >>>> >>>> 1) You'll only use est2genome and protein2genome for creating models that >>>> will be used for training the ab-initio predictors (like SNAP). Sometimes >>>> that means one run of MAKER for training; sometimes that means two runs of >>>> MAKER. You usually don't gain any accuracy after the second round of >>>> training. It's ok to use both EST and protein data for this training step. >>>> >>>> 2) If you're using both ESTs and protein sequence to train your ab-initio >>>> predictors, then both est2genome and protein2genome should be set to 1. >>>> >>>> 3) If you want to pass Blast results to MAKER, you'll need to pass those >>>> results as GFF3, but MAKER will install and run blast for you, and does a >>>> good job of keeping track of all those results and making them accessible >>>> to you in the end, so it's going to be a lot of work to do those blasts on >>>> your own outside of MAKER. I seriously suggest that you use blast internal >>>> to maker. >>>> >>>> Daniel Ence >>>> Graduate Student >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> >>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Panos >>>> Ioannidis [panos.ioannidis at gmail.com] >>>> Sent: Friday, July 11, 2014 5:56 AM >>>> To: maker-devel >>>> Subject: [maker-devel] (no subject) >>>> >>>> I got back to my annotations this past week and have a couple of questions! >>>> >>>> 1) Since my organism isn't closely related with any other that's already >>>> sequenced, I will have to run maker twice (according to the tutorial). So >>>> for the first run I see that some people use only the ESTs and some others >>>> use ESTs and a protein database (CEGMA, Uniref50, Swiss-Prot, etc). I guess >>>> that the ESTs will give better models, but for the cases where genes aren't >>>> covered by an EST, it's okay to have a protein database to detect them as >>>> well. Am I right? What do you think? >>>> >>>> 2) In case I use both ESTs and a protein database how should I set the >>>> est2genome and protein2genome parameters in the maker_opts.ctl file? Should >>>> they both equal to "1"? >>>> >>>> 3) I've been thinking of running the BLAST searches separately and giving >>>> Maker directly the results. I guess that in this case, I'll have to first >>>> convert the BLAST output to a gff3 file and give it to the protein_gff >>>> parameter, right? >>>> >>>> Thanks, >>>> Panos >>> >>> _______________________________________________ maker-devel mailing list >>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/m >>> aker-devel_yandell-lab.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nguyenan at mail.nih.gov Wed Jul 16 11:15:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 17:15:10 +0000 Subject: [maker-devel] Maker_opts.ctl Message-ID: Hi, I would like to conduct a genome annotation and have the following data: - Two separate RepeatMasker outputs (using -lib and -species options) - ESTs and RACE (fasta) - proteins (fasta) - proteins of related organisms (fasta) - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) - GeneMark's .hmm file (es.mod file from running gm_es.pl) - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. - Cufflinks and Trinity from RNA-Seq Could you please let me know how can I specify parameters in the maker_opts.ctl file? Or do you have other suggestions to re-do the data listed above? Thanks. Anh-Dao From dence at genetics.utah.edu Wed Jul 16 12:13:46 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:13:46 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: Message-ID: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Hi Anh-Dao, In the maker_opts.ctl file, there are options for est and protein evidence. You?ll put all of your fasta est files together in a command separated list in the ?est" option, and all of your fasta protein files in a command separated list for the ?protein? option. You?ll specify the SNAP and Genemark files in their respective options in the control file and pass the augustus and fgenesh predictions in the ?pred_gff? option. If you have the RepeatMasker output in gff3 format you can give it to maker with the ?rm_gff? option. If you?ve converted the cufflinks output to gff3, you can give it to maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta output, so you would put that in the ?est? option, along with all the other est fasta files. If Augustus isn?t trained for your particular organism, then you can use another organism that augustus is already trained for. The list of species that augustus has parameter files for is in the README.txt that came with Augustus. I really recommend that you run Augustus from inside maker, because then you get all the benefits of maker passing ext-based hints to augustus at runtime, which can really improve Augustus? predictive ability. When you ran the augustus gene prediction separately, did you use another organism?s parameter file? Thanks, Daniel On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Hi, > > I would like to conduct a genome annotation and have the following data: > - Two separate RepeatMasker outputs (using -lib and -species options) > - ESTs and RACE (fasta) > - proteins (fasta) > - proteins of related organisms (fasta) > - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF format, etc. ) > - GeneMark's .hmm file (es.mod file from running gm_es.pl) > - FGENESH++ and Augustus gene predictions. I wrote scripts to convert the outputs to .gff3 files. The reason why I ran Augustus gene prediction separately, because the genome has never been trained for Augustus. > - Cufflinks and Trinity from RNA-Seq > > Could you please let me know how can I specify parameters in the maker_opts.ctl file? > Or do you have other suggestions to re-do the data listed above? > > Thanks. > Anh-Dao > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 12:30:10 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 18:30:10 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Thanks Daniel for your quick response. I did not use the parameter file of other organism when running Augustus. I created the parameter file for the genome following their instructions. There were multiple steps to train and run Augustus (Creating gene structures for training AUGUSTUS with CEGMA => parameter file will be created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) As I mentioned the reason why I ran Augustus separately, because Augustus has not trained that genome (no parameter file exists). Otherwise I would run Augustus inside MAKER. You suggested to use rm_gff option to specify RepeatMasker output (sure I will convert them to .gff3 formatted files). Can I submit two RM .gff3 files, separated by comma? Anh-Dao On 7/16/14 2:13 PM, "Daniel Ence" wrote: >Hi Anh-Dao, > >In the maker_opts.ctl file, there are options for est and protein >evidence. You?ll put all of your fasta est files together in a command >separated list in the ?est" option, and all of your fasta protein files >in a command separated list for the ?protein? option. > >You?ll specify the SNAP and Genemark files in their respective options in >the control file and pass the augustus and fgenesh predictions in the >?pred_gff? option. > >If you have the RepeatMasker output in gff3 format you can give it to >maker with the ?rm_gff? option. > >If you?ve converted the cufflinks output to gff3, you can give it to >maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >output, so you would put that in the ?est? option, along with all the >other est fasta files. > >If Augustus isn?t trained for your particular organism, then you can use >another organism that augustus is already trained for. The list of >species that augustus has parameter files for is in the README.txt that >came with Augustus. I really recommend that you run Augustus from inside >maker, because then you get all the benefits of maker passing ext-based >hints to augustus at runtime, which can really improve Augustus? >predictive ability. > >When you ran the augustus gene prediction separately, did you use another >organism?s parameter file? > >Thanks, >Daniel > > >On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Hi, >> >> I would like to conduct a genome annotation and have the following data: >> - Two separate RepeatMasker outputs (using -lib and -species options) >> - ESTs and RACE (fasta) >> - proteins (fasta) >> - proteins of related organisms (fasta) >> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>format, etc. ) >> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>the outputs to .gff3 files. The reason why I ran Augustus gene >>prediction separately, because the genome has never been trained for >>Augustus. >> - Cufflinks and Trinity from RNA-Seq >> >> Could you please let me know how can I specify parameters in the >>maker_opts.ctl file? >> Or do you have other suggestions to re-do the data listed above? >> >> Thanks. >> Anh-Dao >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Wed Jul 16 12:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 12:36:57 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:36:57 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: When you ran Augustus separately, it should have created the parameters needed to run it. Now you should be able to run it inside of MAKER using the species name you just created. I'd also recommend letting MAKER run RepeatMasker for you rather than giving it the results as GFF3. --Carson On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >Thanks Daniel for your quick response. > >I did not use the parameter file of other organism when running Augustus. >I created the parameter file for the genome following their instructions. >There were multiple steps to train and run Augustus (Creating gene >structures for training AUGUSTUS with CEGMA => parameter file will be >created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >As I mentioned the reason why I ran Augustus separately, because Augustus >has not trained that genome (no parameter file exists). Otherwise I would >run Augustus inside MAKER. > >You suggested to use rm_gff option to specify RepeatMasker output (sure I >will convert them to .gff3 formatted files). Can I submit two RM .gff3 >files, separated by comma? > >Anh-Dao > > >On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >>Hi Anh-Dao, >> >>In the maker_opts.ctl file, there are options for est and protein >>evidence. You?ll put all of your fasta est files together in a command >>separated list in the ?est" option, and all of your fasta protein files >>in a command separated list for the ?protein? option. >> >>You?ll specify the SNAP and Genemark files in their respective options in >>the control file and pass the augustus and fgenesh predictions in the >>?pred_gff? option. >> >>If you have the RepeatMasker output in gff3 format you can give it to >>maker with the ?rm_gff? option. >> >>If you?ve converted the cufflinks output to gff3, you can give it to >>maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >>output, so you would put that in the ?est? option, along with all the >>other est fasta files. >> >>If Augustus isn?t trained for your particular organism, then you can use >>another organism that augustus is already trained for. The list of >>species that augustus has parameter files for is in the README.txt that >>came with Augustus. I really recommend that you run Augustus from inside >>maker, because then you get all the benefits of maker passing ext-based >>hints to augustus at runtime, which can really improve Augustus? >>predictive ability. >> >>When you ran the augustus gene prediction separately, did you use another >>organism?s parameter file? >> >>Thanks, >>Daniel >> >> >>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following >>>data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>>format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>prediction separately, because the genome has never been trained for >>>Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>>maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Wed Jul 16 12:41:47 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:41:47 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: If you can provide me the command lines you used to train augustus, I can point you to the proper species parameters to give to MAKER. Normally these are the same as one of the directory names under .../augustus/config/species/. You can also let MAKER run FGENESH for you. Either way you can pass it in as GFF3, but if you let MAEKR run it for you then MAEKR can "talk" to the predictor by giving it evidence based hints as it is running. This improves the overall performance of the algorithm compared to running it outside of MAKER. Thanks, Carson On 7/16/14, 12:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From dence at genetics.utah.edu Wed Jul 16 12:42:16 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 16 Jul 2014 12:42:16 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: Hi Anh-Dao, so as I understand it, the process of training and running augustus will create a set of ?param? file that Augustus can use later on. If that?s true, then you can just copy those files to the ?config/species? folder of your augustus installation and then augustus (when you call it from inside maker) can use those parameters when it runs. Did you end up with a gff3 file or with files like ?exon_prob?, ?utr_probs? from augustus? Or did you have both? I?m pretty sure that you can?t use a comma-separated list for the rm_gff. You could concatenate the two files and then pass the one file to maker, but you also might need to have it sorted by genomic location. Carson could confirm that for me. ~Daniel On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] wrote: > Thanks Daniel for your quick response. > > I did not use the parameter file of other organism when running Augustus. > I created the parameter file for the genome following their instructions. > There were multiple steps to train and run Augustus (Creating gene > structures for training AUGUSTUS with CEGMA => parameter file will be > created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; > Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) > As I mentioned the reason why I ran Augustus separately, because Augustus > has not trained that genome (no parameter file exists). Otherwise I would > run Augustus inside MAKER. > > You suggested to use rm_gff option to specify RepeatMasker output (sure I > will convert them to .gff3 formatted files). Can I submit two RM .gff3 > files, separated by comma? > > Anh-Dao > > > On 7/16/14 2:13 PM, "Daniel Ence" wrote: > >> Hi Anh-Dao, >> >> In the maker_opts.ctl file, there are options for est and protein >> evidence. You?ll put all of your fasta est files together in a command >> separated list in the ?est" option, and all of your fasta protein files >> in a command separated list for the ?protein? option. >> >> You?ll specify the SNAP and Genemark files in their respective options in >> the control file and pass the augustus and fgenesh predictions in the >> ?pred_gff? option. >> >> If you have the RepeatMasker output in gff3 format you can give it to >> maker with the ?rm_gff? option. >> >> If you?ve converted the cufflinks output to gff3, you can give it to >> maker with the ?est_gff? option. I?m pretty sure Trinity only gives fasta >> output, so you would put that in the ?est? option, along with all the >> other est fasta files. >> >> If Augustus isn?t trained for your particular organism, then you can use >> another organism that augustus is already trained for. The list of >> species that augustus has parameter files for is in the README.txt that >> came with Augustus. I really recommend that you run Augustus from inside >> maker, because then you get all the benefits of maker passing ext-based >> hints to augustus at runtime, which can really improve Augustus? >> predictive ability. >> >> When you ran the augustus gene prediction separately, did you use another >> organism?s parameter file? >> >> Thanks, >> Daniel >> >> >> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >> wrote: >> >>> Hi, >>> >>> I would like to conduct a genome annotation and have the following data: >>> - Two separate RepeatMasker outputs (using -lib and -species options) >>> - ESTs and RACE (fasta) >>> - proteins (fasta) >>> - proteins of related organisms (fasta) >>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to ZFF >>> format, etc. ) >>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>> the outputs to .gff3 files. The reason why I ran Augustus gene >>> prediction separately, because the genome has never been trained for >>> Augustus. >>> - Cufflinks and Trinity from RNA-Seq >>> >>> Could you please let me know how can I specify parameters in the >>> maker_opts.ctl file? >>> Or do you have other suggestions to re-do the data listed above? >>> >>> Thanks. >>> Anh-Dao >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > From carsonhh at gmail.com Wed Jul 16 12:43:33 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 12:43:33 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: You can use comma separated lists. --Carson On 7/16/14, 12:42 PM, "Daniel Ence" wrote: >Hi Anh-Dao, so as I understand it, the process of training and running >augustus will create a set of ?param? file that Augustus can use later >on. If that?s true, then you can just copy those files to the >?config/species? folder of your augustus installation and then augustus >(when you call it from inside maker) can use those parameters when it >runs. > >Did you end up with a gff3 file or with files like ?exon_prob?, >?utr_probs? from augustus? Or did you have both? > >I?m pretty sure that you can?t use a comma-separated list for the rm_gff. >You could concatenate the two files and then pass the one file to maker, >but you also might need to have it sorted by genomic location. Carson >could confirm that for me. > >~Daniel > > >On Jul 16, 2014, at 12:30 PM, Nguyen, Anh-Dao (NIH/NHGRI) [C] > wrote: > >> Thanks Daniel for your quick response. >> >> I did not use the parameter file of other organism when running >>Augustus. >> I created the parameter file for the genome following their >>instructions. >> There were multiple steps to train and run Augustus (Creating gene >> structures for training AUGUSTUS with CEGMA => parameter file will be >> created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >> Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >> As I mentioned the reason why I ran Augustus separately, because >>Augustus >> has not trained that genome (no parameter file exists). Otherwise I >>would >> run Augustus inside MAKER. >> >> You suggested to use rm_gff option to specify RepeatMasker output (sure >>I >> will convert them to .gff3 formatted files). Can I submit two RM .gff3 >> files, separated by comma? >> >> Anh-Dao >> >> >> On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>> Hi Anh-Dao, >>> >>> In the maker_opts.ctl file, there are options for est and protein >>> evidence. You?ll put all of your fasta est files together in a command >>> separated list in the ?est" option, and all of your fasta protein files >>> in a command separated list for the ?protein? option. >>> >>> You?ll specify the SNAP and Genemark files in their respective options >>>in >>> the control file and pass the augustus and fgenesh predictions in the >>> ?pred_gff? option. >>> >>> If you have the RepeatMasker output in gff3 format you can give it to >>> maker with the ?rm_gff? option. >>> >>> If you?ve converted the cufflinks output to gff3, you can give it to >>> maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>> output, so you would put that in the ?est? option, along with all the >>> other est fasta files. >>> >>> If Augustus isn?t trained for your particular organism, then you can >>>use >>> another organism that augustus is already trained for. The list of >>> species that augustus has parameter files for is in the README.txt that >>> came with Augustus. I really recommend that you run Augustus from >>>inside >>> maker, because then you get all the benefits of maker passing ext-based >>> hints to augustus at runtime, which can really improve Augustus? >>> predictive ability. >>> >>> When you ran the augustus gene prediction separately, did you use >>>another >>> organism?s parameter file? >>> >>> Thanks, >>> Daniel >>> >>> >>> On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>> format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>> the outputs to .gff3 files. The reason why I ran Augustus gene >>>> prediction separately, because the genome has never been trained for >>>> Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>> maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From nguyenan at mail.nih.gov Wed Jul 16 13:07:45 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:07:45 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I will run Augustus and FGENESH++ inside of MAKER using the parameter files for Augustus. I could also run RepeatMasker inside of MAKER. However, I ran RM using two options: -lib (de novo) and -species (known). I got ~ 45% repeats via de novo and ~ 4% repeats via known options. As I understood, RM inside of MAKER uses only RepBase repeat library and RepeatRunner protein database. Anh-Dao On 7/16/14 2:36 PM, "Carson Holt" wrote: >When you ran Augustus separately, it should have created the parameters >needed to run it. Now you should be able to run it inside of MAKER using >the species name you just created. > >I'd also recommend letting MAKER run RepeatMasker for you rather than >giving it the results as GFF3. > >--Carson > > >On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>Thanks Daniel for your quick response. >> >>I did not use the parameter file of other organism when running Augustus. >>I created the parameter file for the genome following their instructions. >>There were multiple steps to train and run Augustus (Creating gene >>structures for training AUGUSTUS with CEGMA => parameter file will be >>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>As I mentioned the reason why I ran Augustus separately, because Augustus >>has not trained that genome (no parameter file exists). Otherwise I would >>run Augustus inside MAKER. >> >>You suggested to use rm_gff option to specify RepeatMasker output (sure I >>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>files, separated by comma? >> >>Anh-Dao >> >> >>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >> >>>Hi Anh-Dao, >>> >>>In the maker_opts.ctl file, there are options for est and protein >>>evidence. You?ll put all of your fasta est files together in a command >>>separated list in the ?est" option, and all of your fasta protein files >>>in a command separated list for the ?protein? option. >>> >>>You?ll specify the SNAP and Genemark files in their respective options >>>in >>>the control file and pass the augustus and fgenesh predictions in the >>>?pred_gff? option. >>> >>>If you have the RepeatMasker output in gff3 format you can give it to >>>maker with the ?rm_gff? option. >>> >>>If you?ve converted the cufflinks output to gff3, you can give it to >>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>fasta >>>output, so you would put that in the ?est? option, along with all the >>>other est fasta files. >>> >>>If Augustus isn?t trained for your particular organism, then you can use >>>another organism that augustus is already trained for. The list of >>>species that augustus has parameter files for is in the README.txt that >>>came with Augustus. I really recommend that you run Augustus from inside >>>maker, because then you get all the benefits of maker passing ext-based >>>hints to augustus at runtime, which can really improve Augustus? >>>predictive ability. >>> >>>When you ran the augustus gene prediction separately, did you use >>>another >>>organism?s parameter file? >>> >>>Thanks, >>>Daniel >>> >>> >>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to conduct a genome annotation and have the following >>>>data: >>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>> - ESTs and RACE (fasta) >>>> - proteins (fasta) >>>> - proteins of related organisms (fasta) >>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>ZFF >>>>format, etc. ) >>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>prediction separately, because the genome has never been trained for >>>>Augustus. >>>> - Cufflinks and Trinity from RNA-Seq >>>> >>>> Could you please let me know how can I specify parameters in the >>>>maker_opts.ctl file? >>>> Or do you have other suggestions to re-do the data listed above? >>>> >>>> Thanks. >>>> Anh-Dao >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >>_______________________________________________ >>maker-devel mailing list >>maker-devel at box290.bluehost.com >>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From nguyenan at mail.nih.gov Wed Jul 16 13:16:43 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:16:43 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I forget to mention that I ran RepeatModeler on the genome first, then used the output of RepeatModeler to submit to RepeatMasker using -lib option (de novo). For the -species option, I used metazoa Anh-Dao On 7/16/14 3:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From carsonhh at gmail.com Wed Jul 16 13:17:31 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:17:31 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: No. You can provide both to MAKER. The options are model_org= and rmlib=. By letting MAKER handle repeat masking it will differentiate repeat types and use soft masking for some and hard masking for others. This increases sensitivity of evidence alignments while still maintaining specificity. --Carson On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I will run Augustus and FGENESH++ inside of MAKER using the parameter >files for Augustus. >I could also run RepeatMasker inside of MAKER. However, I ran RM using two >options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >novo and ~ 4% repeats via known options. As I understood, RM inside of >MAKER uses only RepBase repeat library and RepeatRunner protein database. > >Anh-Dao > > >On 7/16/14 2:36 PM, "Carson Holt" wrote: > >>When you ran Augustus separately, it should have created the parameters >>needed to run it. Now you should be able to run it inside of MAKER using >>the species name you just created. >> >>I'd also recommend letting MAKER run RepeatMasker for you rather than >>giving it the results as GFF3. >> >>--Carson >> >> >>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>Thanks Daniel for your quick response. >>> >>>I did not use the parameter file of other organism when running >>>Augustus. >>>I created the parameter file for the genome following their >>>instructions. >>>There were multiple steps to train and run Augustus (Creating gene >>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>As I mentioned the reason why I ran Augustus separately, because >>>Augustus >>>has not trained that genome (no parameter file exists). Otherwise I >>>would >>>run Augustus inside MAKER. >>> >>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>I >>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>files, separated by comma? >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>> >>>>Hi Anh-Dao, >>>> >>>>In the maker_opts.ctl file, there are options for est and protein >>>>evidence. You?ll put all of your fasta est files together in a command >>>>separated list in the ?est" option, and all of your fasta protein files >>>>in a command separated list for the ?protein? option. >>>> >>>>You?ll specify the SNAP and Genemark files in their respective options >>>>in >>>>the control file and pass the augustus and fgenesh predictions in the >>>>?pred_gff? option. >>>> >>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>maker with the ?rm_gff? option. >>>> >>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>fasta >>>>output, so you would put that in the ?est? option, along with all the >>>>other est fasta files. >>>> >>>>If Augustus isn?t trained for your particular organism, then you can >>>>use >>>>another organism that augustus is already trained for. The list of >>>>species that augustus has parameter files for is in the README.txt that >>>>came with Augustus. I really recommend that you run Augustus from >>>>inside >>>>maker, because then you get all the benefits of maker passing ext-based >>>>hints to augustus at runtime, which can really improve Augustus? >>>>predictive ability. >>>> >>>>When you ran the augustus gene prediction separately, did you use >>>>another >>>>organism?s parameter file? >>>> >>>>Thanks, >>>>Daniel >>>> >>>> >>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I would like to conduct a genome annotation and have the following >>>>>data: >>>>> - Two separate RepeatMasker outputs (using -lib and -species options) >>>>> - ESTs and RACE (fasta) >>>>> - proteins (fasta) >>>>> - proteins of related organisms (fasta) >>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>ZFF >>>>>format, etc. ) >>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to convert >>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>prediction separately, because the genome has never been trained for >>>>>Augustus. >>>>> - Cufflinks and Trinity from RNA-Seq >>>>> >>>>> Could you please let me know how can I specify parameters in the >>>>>maker_opts.ctl file? >>>>> Or do you have other suggestions to re-do the data listed above? >>>>> >>>>> Thanks. >>>>> Anh-Dao >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>> >>> >>>_______________________________________________ >>>maker-devel mailing list >>>maker-devel at box290.bluehost.com >>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > From nguyenan at mail.nih.gov Wed Jul 16 13:28:33 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Wed, 16 Jul 2014 19:28:33 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: By default, model_org=all. Can I use the de novo repeat library predicted by RepeatModeler for the rmlib option? Anh-Dao On 7/16/14 3:17 PM, "Carson Holt" wrote: >No. You can provide both to MAKER. The options are model_org= and rmlib=. > By letting MAKER handle repeat masking it will differentiate repeat types >and use soft masking for some and hard masking for others. This increases >sensitivity of evidence alignments while still maintaining specificity. > >--Carson > > > >On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>files for Augustus. >>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>two >>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>novo and ~ 4% repeats via known options. As I understood, RM inside of >>MAKER uses only RepBase repeat library and RepeatRunner protein database. >> >>Anh-Dao >> >> >>On 7/16/14 2:36 PM, "Carson Holt" wrote: >> >>>When you ran Augustus separately, it should have created the parameters >>>needed to run it. Now you should be able to run it inside of MAKER >>>using >>>the species name you just created. >>> >>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>giving it the results as GFF3. >>> >>>--Carson >>> >>> >>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>Thanks Daniel for your quick response. >>>> >>>>I did not use the parameter file of other organism when running >>>>Augustus. >>>>I created the parameter file for the genome following their >>>>instructions. >>>>There were multiple steps to train and run Augustus (Creating gene >>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>As I mentioned the reason why I ran Augustus separately, because >>>>Augustus >>>>has not trained that genome (no parameter file exists). Otherwise I >>>>would >>>>run Augustus inside MAKER. >>>> >>>>You suggested to use rm_gff option to specify RepeatMasker output (sure >>>>I >>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>files, separated by comma? >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>> >>>>>Hi Anh-Dao, >>>>> >>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>evidence. You?ll put all of your fasta est files together in a command >>>>>separated list in the ?est" option, and all of your fasta protein >>>>>files >>>>>in a command separated list for the ?protein? option. >>>>> >>>>>You?ll specify the SNAP and Genemark files in their respective options >>>>>in >>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>?pred_gff? option. >>>>> >>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>maker with the ?rm_gff? option. >>>>> >>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>fasta >>>>>output, so you would put that in the ?est? option, along with all the >>>>>other est fasta files. >>>>> >>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>use >>>>>another organism that augustus is already trained for. The list of >>>>>species that augustus has parameter files for is in the README.txt >>>>>that >>>>>came with Augustus. I really recommend that you run Augustus from >>>>>inside >>>>>maker, because then you get all the benefits of maker passing >>>>>ext-based >>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>predictive ability. >>>>> >>>>>When you ran the augustus gene prediction separately, did you use >>>>>another >>>>>organism?s parameter file? >>>>> >>>>>Thanks, >>>>>Daniel >>>>> >>>>> >>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I would like to conduct a genome annotation and have the following >>>>>>data: >>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>options) >>>>>> - ESTs and RACE (fasta) >>>>>> - proteins (fasta) >>>>>> - proteins of related organisms (fasta) >>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>ZFF >>>>>>format, etc. ) >>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>convert >>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>prediction separately, because the genome has never been trained for >>>>>>Augustus. >>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>> >>>>>> Could you please let me know how can I specify parameters in the >>>>>>maker_opts.ctl file? >>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>> >>>>>> Thanks. >>>>>> Anh-Dao >>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>> >>>> >>>>_______________________________________________ >>>>maker-devel mailing list >>>>maker-devel at box290.bluehost.com >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> > > From carsonhh at gmail.com Wed Jul 16 13:32:02 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 16 Jul 2014 13:32:02 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: 'all' will use the whole of RepBase, or you can do 'metazoa' like your previous run. Then provide the RepeatModeler file to rmlib= --Carson On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >By default, model_org=all. Can I use the de novo repeat library predicted >by RepeatModeler for the rmlib option? > >Anh-Dao > > > >On 7/16/14 3:17 PM, "Carson Holt" wrote: > >>No. You can provide both to MAKER. The options are model_org= and >>rmlib=. >> By letting MAKER handle repeat masking it will differentiate repeat >>types >>and use soft masking for some and hard masking for others. This >>increases >>sensitivity of evidence alignments while still maintaining specificity. >> >>--Carson >> >> >> >>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>files for Augustus. >>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>two >>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via de >>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>database. >>> >>>Anh-Dao >>> >>> >>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>> >>>>When you ran Augustus separately, it should have created the parameters >>>>needed to run it. Now you should be able to run it inside of MAKER >>>>using >>>>the species name you just created. >>>> >>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>giving it the results as GFF3. >>>> >>>>--Carson >>>> >>>> >>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>Thanks Daniel for your quick response. >>>>> >>>>>I did not use the parameter file of other organism when running >>>>>Augustus. >>>>>I created the parameter file for the genome following their >>>>>instructions. >>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>Augustus >>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>would >>>>>run Augustus inside MAKER. >>>>> >>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>(sure >>>>>I >>>>>will convert them to .gff3 formatted files). Can I submit two RM .gff3 >>>>>files, separated by comma? >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>> >>>>>>Hi Anh-Dao, >>>>>> >>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>command >>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>files >>>>>>in a command separated list for the ?protein? option. >>>>>> >>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>options >>>>>>in >>>>>>the control file and pass the augustus and fgenesh predictions in the >>>>>>?pred_gff? option. >>>>>> >>>>>>If you have the RepeatMasker output in gff3 format you can give it to >>>>>>maker with the ?rm_gff? option. >>>>>> >>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>fasta >>>>>>output, so you would put that in the ?est? option, along with all the >>>>>>other est fasta files. >>>>>> >>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>use >>>>>>another organism that augustus is already trained for. The list of >>>>>>species that augustus has parameter files for is in the README.txt >>>>>>that >>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>inside >>>>>>maker, because then you get all the benefits of maker passing >>>>>>ext-based >>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>predictive ability. >>>>>> >>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>another >>>>>>organism?s parameter file? >>>>>> >>>>>>Thanks, >>>>>>Daniel >>>>>> >>>>>> >>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>data: >>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>options) >>>>>>> - ESTs and RACE (fasta) >>>>>>> - proteins (fasta) >>>>>>> - proteins of related organisms (fasta) >>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert to >>>>>>>ZFF >>>>>>>format, etc. ) >>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>convert >>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>prediction separately, because the genome has never been trained for >>>>>>>Augustus. >>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>> >>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>maker_opts.ctl file? >>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>> >>>>>>> Thanks. >>>>>>> Anh-Dao >>>>>>> >>>>>>> _______________________________________________ >>>>>>> maker-devel mailing list >>>>>>> maker-devel at box290.bluehost.com >>>>>>> >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>> >>>>> >>>>>_______________________________________________ >>>>>maker-devel mailing list >>>>>maker-devel at box290.bluehost.com >>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or >>>>>g >>>> >>>> >>> >> >> > From nguyenan at mail.nih.gov Thu Jul 17 08:19:34 2014 From: nguyenan at mail.nih.gov (Nguyen, Anh-Dao (NIH/NHGRI) [C]) Date: Thu, 17 Jul 2014 14:19:34 +0000 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: I am not sure which fgenesh executable file should I use. fgenesh= #location of fgenesh executable When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need to specify a list of other executable programs (such as ppd, ppdn+, etc) Anh-Dao On 7/16/14 3:32 PM, "Carson Holt" wrote: >'all' will use the whole of RepBase, or you can do 'metazoa' like your >previous run. Then provide the RepeatModeler file to rmlib= > >--Carson > > > >On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" > wrote: > >>By default, model_org=all. Can I use the de novo repeat library predicted >>by RepeatModeler for the rmlib option? >> >>Anh-Dao >> >> >> >>On 7/16/14 3:17 PM, "Carson Holt" wrote: >> >>>No. You can provide both to MAKER. The options are model_org= and >>>rmlib=. >>> By letting MAKER handle repeat masking it will differentiate repeat >>>types >>>and use soft masking for some and hard masking for others. This >>>increases >>>sensitivity of evidence alignments while still maintaining specificity. >>> >>>--Carson >>> >>> >>> >>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>> wrote: >>> >>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>files for Augustus. >>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>two >>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>de >>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>database. >>>> >>>>Anh-Dao >>>> >>>> >>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>> >>>>>When you ran Augustus separately, it should have created the >>>>>parameters >>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>using >>>>>the species name you just created. >>>>> >>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>giving it the results as GFF3. >>>>> >>>>>--Carson >>>>> >>>>> >>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>> wrote: >>>>> >>>>>>Thanks Daniel for your quick response. >>>>>> >>>>>>I did not use the parameter file of other organism when running >>>>>>Augustus. >>>>>>I created the parameter file for the genome following their >>>>>>instructions. >>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>structures for training AUGUSTUS with CEGMA => parameter file will be >>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>Augustus >>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>would >>>>>>run Augustus inside MAKER. >>>>>> >>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>(sure >>>>>>I >>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>.gff3 >>>>>>files, separated by comma? >>>>>> >>>>>>Anh-Dao >>>>>> >>>>>> >>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>> >>>>>>>Hi Anh-Dao, >>>>>>> >>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>command >>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>files >>>>>>>in a command separated list for the ?protein? option. >>>>>>> >>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>options >>>>>>>in >>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>the >>>>>>>?pred_gff? option. >>>>>>> >>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>to >>>>>>>maker with the ?rm_gff? option. >>>>>>> >>>>>>>If you?ve converted the cufflinks output to gff3, you can give it to >>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>fasta >>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>the >>>>>>>other est fasta files. >>>>>>> >>>>>>>If Augustus isn?t trained for your particular organism, then you can >>>>>>>use >>>>>>>another organism that augustus is already trained for. The list of >>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>that >>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>inside >>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>ext-based >>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>predictive ability. >>>>>>> >>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>another >>>>>>>organism?s parameter file? >>>>>>> >>>>>>>Thanks, >>>>>>>Daniel >>>>>>> >>>>>>> >>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I would like to conduct a genome annotation and have the following >>>>>>>>data: >>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>options) >>>>>>>> - ESTs and RACE (fasta) >>>>>>>> - proteins (fasta) >>>>>>>> - proteins of related organisms (fasta) >>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>to >>>>>>>>ZFF >>>>>>>>format, etc. ) >>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>convert >>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>for >>>>>>>>Augustus. >>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>> >>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>maker_opts.ctl file? >>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Anh-Dao >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> maker-devel mailing list >>>>>>>> maker-devel at box290.bluehost.com >>>>>>>> >>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab >>>>>>>>. >>>>>>>>o >>>>>>>>r >>>>>>>>g >>>>>>> >>>>>> >>>>>> >>>>>>_______________________________________________ >>>>>>maker-devel mailing list >>>>>>maker-devel at box290.bluehost.com >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>r >>>>>>g >>>>> >>>>> >>>> >>> >>> >> > > From carsonhh at gmail.com Fri Jul 18 11:04:09 2014 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 18 Jul 2014 11:04:09 -0600 Subject: [maker-devel] Maker_opts.ctl In-Reply-To: References: <5B00AEBB-D242-4C7C-A40F-C1BF3EC48C96@genetics.utah.edu> Message-ID: It should just be 'fgenesh'. If it's not there you can still just give the GFF3. --Carson On 7/17/14, 8:19 AM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" wrote: >I am not sure which fgenesh executable file should I use. > >fgenesh= #location of fgenesh executable > >When I run FGENESH++, I need to run the run_pipe.pl script. Sure you need >to specify a list of other executable programs (such as ppd, ppdn+, etc) > >Anh-Dao > > >On 7/16/14 3:32 PM, "Carson Holt" wrote: > >>'all' will use the whole of RepBase, or you can do 'metazoa' like your >>previous run. Then provide the RepeatModeler file to rmlib= >> >>--Carson >> >> >> >>On 7/16/14, 1:28 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >> wrote: >> >>>By default, model_org=all. Can I use the de novo repeat library >>>predicted >>>by RepeatModeler for the rmlib option? >>> >>>Anh-Dao >>> >>> >>> >>>On 7/16/14 3:17 PM, "Carson Holt" wrote: >>> >>>>No. You can provide both to MAKER. The options are model_org= and >>>>rmlib=. >>>> By letting MAKER handle repeat masking it will differentiate repeat >>>>types >>>>and use soft masking for some and hard masking for others. This >>>>increases >>>>sensitivity of evidence alignments while still maintaining specificity. >>>> >>>>--Carson >>>> >>>> >>>> >>>>On 7/16/14, 1:07 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>> wrote: >>>> >>>>>I will run Augustus and FGENESH++ inside of MAKER using the parameter >>>>>files for Augustus. >>>>>I could also run RepeatMasker inside of MAKER. However, I ran RM using >>>>>two >>>>>options: -lib (de novo) and -species (known). I got ~ 45% repeats via >>>>>de >>>>>novo and ~ 4% repeats via known options. As I understood, RM inside of >>>>>MAKER uses only RepBase repeat library and RepeatRunner protein >>>>>database. >>>>> >>>>>Anh-Dao >>>>> >>>>> >>>>>On 7/16/14 2:36 PM, "Carson Holt" wrote: >>>>> >>>>>>When you ran Augustus separately, it should have created the >>>>>>parameters >>>>>>needed to run it. Now you should be able to run it inside of MAKER >>>>>>using >>>>>>the species name you just created. >>>>>> >>>>>>I'd also recommend letting MAKER run RepeatMasker for you rather than >>>>>>giving it the results as GFF3. >>>>>> >>>>>>--Carson >>>>>> >>>>>> >>>>>>On 7/16/14, 12:30 PM, "Nguyen, Anh-Dao (NIH/NHGRI) [C]" >>>>>> wrote: >>>>>> >>>>>>>Thanks Daniel for your quick response. >>>>>>> >>>>>>>I did not use the parameter file of other organism when running >>>>>>>Augustus. >>>>>>>I created the parameter file for the genome following their >>>>>>>instructions. >>>>>>>There were multiple steps to train and run Augustus (Creating gene >>>>>>>structures for training AUGUSTUS with CEGMA => parameter file will >>>>>>>be >>>>>>>created; Creating Hints for AUGUSTUS from ESTs/cDNA sequences; >>>>>>>Incorporating Illumina RNAseq into AUGUSTUS with GSNAP, etc.) >>>>>>>As I mentioned the reason why I ran Augustus separately, because >>>>>>>Augustus >>>>>>>has not trained that genome (no parameter file exists). Otherwise I >>>>>>>would >>>>>>>run Augustus inside MAKER. >>>>>>> >>>>>>>You suggested to use rm_gff option to specify RepeatMasker output >>>>>>>(sure >>>>>>>I >>>>>>>will convert them to .gff3 formatted files). Can I submit two RM >>>>>>>.gff3 >>>>>>>files, separated by comma? >>>>>>> >>>>>>>Anh-Dao >>>>>>> >>>>>>> >>>>>>>On 7/16/14 2:13 PM, "Daniel Ence" wrote: >>>>>>> >>>>>>>>Hi Anh-Dao, >>>>>>>> >>>>>>>>In the maker_opts.ctl file, there are options for est and protein >>>>>>>>evidence. You?ll put all of your fasta est files together in a >>>>>>>>command >>>>>>>>separated list in the ?est" option, and all of your fasta protein >>>>>>>>files >>>>>>>>in a command separated list for the ?protein? option. >>>>>>>> >>>>>>>>You?ll specify the SNAP and Genemark files in their respective >>>>>>>>options >>>>>>>>in >>>>>>>>the control file and pass the augustus and fgenesh predictions in >>>>>>>>the >>>>>>>>?pred_gff? option. >>>>>>>> >>>>>>>>If you have the RepeatMasker output in gff3 format you can give it >>>>>>>>to >>>>>>>>maker with the ?rm_gff? option. >>>>>>>> >>>>>>>>If you?ve converted the cufflinks output to gff3, you can give it >>>>>>>>to >>>>>>>>maker with the ?est_gff? option. I?m pretty sure Trinity only gives >>>>>>>>fasta >>>>>>>>output, so you would put that in the ?est? option, along with all >>>>>>>>the >>>>>>>>other est fasta files. >>>>>>>> >>>>>>>>If Augustus isn?t trained for your particular organism, then you >>>>>>>>can >>>>>>>>use >>>>>>>>another organism that augustus is already trained for. The list of >>>>>>>>species that augustus has parameter files for is in the README.txt >>>>>>>>that >>>>>>>>came with Augustus. I really recommend that you run Augustus from >>>>>>>>inside >>>>>>>>maker, because then you get all the benefits of maker passing >>>>>>>>ext-based >>>>>>>>hints to augustus at runtime, which can really improve Augustus? >>>>>>>>predictive ability. >>>>>>>> >>>>>>>>When you ran the augustus gene prediction separately, did you use >>>>>>>>another >>>>>>>>organism?s parameter file? >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Daniel >>>>>>>> >>>>>>>> >>>>>>>>On Jul 16, 2014, at 11:15 AM, Nguyen, Anh-Dao (NIH/NHGRI) [C] >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I would like to conduct a genome annotation and have the >>>>>>>>>following >>>>>>>>>data: >>>>>>>>> - Two separate RepeatMasker outputs (using -lib and -species >>>>>>>>>options) >>>>>>>>> - ESTs and RACE (fasta) >>>>>>>>> - proteins (fasta) >>>>>>>>> - proteins of related organisms (fasta) >>>>>>>>> - SNAP's .hmm file (ran CEGMA, then used cegma2zff.pl to convert >>>>>>>>>to >>>>>>>>>ZFF >>>>>>>>>format, etc. ) >>>>>>>>> - GeneMark's .hmm file (es.mod file from running gm_es.pl) >>>>>>>>> - FGENESH++ and Augustus gene predictions. I wrote scripts to >>>>>>>>>convert >>>>>>>>>the outputs to .gff3 files. The reason why I ran Augustus gene >>>>>>>>>prediction separately, because the genome has never been trained >>>>>>>>>for >>>>>>>>>Augustus. >>>>>>>>> - Cufflinks and Trinity from RNA-Seq >>>>>>>>> >>>>>>>>> Could you please let me know how can I specify parameters in the >>>>>>>>>maker_opts.ctl file? >>>>>>>>> Or do you have other suggestions to re-do the data listed above? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Anh-Dao >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> maker-devel mailing list >>>>>>>>> maker-devel at box290.bluehost.com >>>>>>>>> >>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la >>>>>>>>>b >>>>>>>>>. >>>>>>>>>o >>>>>>>>>r >>>>>>>>>g >>>>>>>> >>>>>>> >>>>>>> >>>>>>>_______________________________________________ >>>>>>>maker-devel mailing list >>>>>>>maker-devel at box290.bluehost.com >>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab. >>>>>>>o >>>>>>>r >>>>>>>g >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> > From jp.oeyen at uni-bonn.de Mon Jul 28 06:22:25 2014 From: jp.oeyen at uni-bonn.de (Jan Philip Oeyen) Date: Mon, 28 Jul 2014 14:22:25 +0200 Subject: [maker-devel] Forks.pm error when running maker with dsindex Message-ID: Hi all, we are currently having some unexpected errors when running maker on a genome which is split in several parts. Our cluster admin reported the following error message: Argument "ALRM" isn't numeric in exit at /share/scientific_bin/perlmodu les/lib/site_perl/5.14.2/x86_64-linux-thread-multi/forks.pm line 2188. SIGTERM received SIGTERM received SIGTERM received We were using maker with the '-g' option on a single genome which is split into 20 parts, where 19 parts are equally large and the last contains about 20 sequences more. After that we ran Maker using dsindex to clean up the output. We are currently using maker v2.31 on 4 threads and forks v0.34. If any further info is needed to clarify the problem, please let me know and I will provide as much as possible. Thank you for your help! Best regards, Jan Philip Oeyen ZFMK // ZMB // University of Bonn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mphoeppner at gmail.com Wed Jul 30 04:44:36 2014 From: mphoeppner at gmail.com (=?iso-8859-1?Q?Marc_H=F6ppner?=) Date: Wed, 30 Jul 2014 12:44:36 +0200 Subject: [maker-devel] Maker GFF output with features of 0 length Message-ID: <5C45F418-018B-4ACC-B682-E5659DB7F102@gmail.com> Hi, I?ve - more by accident - found that many of the gene builds I have generated with Maker (2.31.3) contain features with identical start and stop positions. For example: scaffold_2927 maker CDS 13013 13013 . + 1 ID=maker-scaffold_2927-augustus-gene-0.8-mRNA-1:cds;Parent=maker-scaffold_2927-augustus-gene-0.8-mRNA-1 This occurs seemingly randomly for all sorts of feature types and I have only seen this when running Maker on full assemblies. Before I start turning every stone, any ideas about possible explanations for this phenomenon? Is this likely some MPI-related communication issue, or NFS problems with synching data? Maker runs fine on our system, but that doesn?t mean that there aren?t any cryptic issues that only on these occasions read their head? Regarding the frequency, out of 450.000 GFF lines, 270 were affected in the case that I looked into the most. So it is pretty rare, but still... I am currently using Maker with openmpi-1.7.4 and the file system is mounter of NFS4 and IPoIB. I now switched to Maker 2.31.6, but have no strong reason to suspect that this will make a difference. Regards, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: