From Patrick.TranVan at unil.ch Fri Jun 2 04:56:30 2017 From: Patrick.TranVan at unil.ch (Patrick Tran Van) Date: Fri, 2 Jun 2017 09:56:30 +0000 Subject: [maker-devel] Advice on my pipeline Message-ID: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Jun 5 13:24:47 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 5 Jun 2017 18:24:47 +0000 Subject: [maker-devel] Plant genome annotation In-Reply-To: References: Message-ID: <5DD47274-C5FA-404D-A7EC-AADE0325EA03@genetics.utah.edu> MAKER wiki ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 Book chapter on MAKER protocol ?> http://www.yandell-lab.org/publications/pdf/maker_current_protocols.pdf Mailing list ?> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org Searchable archive of common maker related questions ?> https://groups.google.com/forum/#!forum/maker-devel ?Carson On Jun 5, 2017, at 8:18 AM, Muhammad Arslan > wrote: Dear Carson, I am writing this email to ask you a favor from you regarding the usage of Maker-P. I want to use the application for plant genome annotation however has very little knowledge of doing so! Is there any step-by-step tutorial available for doing so? I would be very thankful to you! Best regards -- -------------------------------------------------------------------------------------------- Muhammad Arslan PhD Student / Guest Scientist Department of Environmental Biotechnology Helmholtz Centre for Environmental Research - UFZ Permoserstra?e 15, 04318 Leipzig, Germany Phone +49,341,235 1696, muhammad.arslan at ufz.de , www.ufz.de Registered Office / Registered Office: Leipzig Register court / Registration Office: Amtsgericht Leipzig Commercial register Nr./Trade Register No .: B 4703 Chairman / Chairman of the Supervisory Board: MinDirig Wilfried Kraus Scientific Director / Scientific Managing Director: Prof. Georg Teutsch Administrative Managing Director / Administrative Managing Director: Prof. Dr. Heike Grassmann -------------------------------------------------------------------------------------------- SAVE PAPER - Please do not print this e-mail unless absolutely necessary -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 5 13:29:57 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 5 Jun 2017 12:29:57 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Message-ID: Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson > On Jun 2, 2017, at 3:56 AM, Patrick Tran Van wrote: > > Hello, > > This is my first time running Maker for an insect genome annotation. > > I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: > > > What I have: > - RNA evidence: transcriptome > - Proteine evidence: swissprot/uniprot + busco protein set of insect > - Cegma and busco results of my genome > > > 1) Train SNAP with CEGMA > > 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). > > 3) Create SNAP model from run A. > > 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 5) Create SNAP model from run B. > > 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 7) Create SNAP model from run C AND Create Augustus gene model from run C > > 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 > > > > Does it seems coherent ? > > Cheers, > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 10:51:11 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Wed, 21 Jun 2017 15:51:11 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Jun 21 13:25:43 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 21 Jun 2017 18:25:43 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 21:33:15 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Thu, 22 Jun 2017 02:33:15 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca>, Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Thank you for the thought! So, to clarify do you use funannotate predict on the maker gff files, similar to the last example given here? https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing to give it a shot... Is brings up other questions for me, though. How do you do your functional annotation? Maker? I noticed that funannotate will do functional annotation, but currently was adding in my functional annotation using GAG when I was converting the maker gff to tbl. Also, from what I understand, funannotate will output a gbk from the gff. Do you have a particular file conversion tool to get that onto the sqn format that you've had success with? Thanks, Glenna ________________________________________ From: Jason Stajich [jason.stajich at gmail.com] Sent: Wednesday, June 21, 2017 2:25 PM To: Glenna Kramer; maker-devel at yandell-lab.org Subject: Re: [maker-devel] How to address errors encountered in process of submitting genome Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jason.stajich at gmail.com Wed Jun 21 23:09:50 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 22 Jun 2017 04:09:50 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Quick answers for now. A) you can feed maker gff to Funannotate or run it alone B) I run the annotate step in funannotate but generally transfer only swissprot annots as product desc. Have to manually edit to remove systematic orf names In product desc that NCBI will flag - e.g. YAL001W, AN1234, ARB_xx. You have to edit the annotations.swissprot.txt file to use the product descriptor if you want to promote these to full product descriptions in the resulting .tbl file May want to run iprscan locally or wait for it running remotely to get GO assignments included. C) you get .tbl and Fsa Files from gag and these are processed by tbl2asn to get sqn file. All are produced in the result file. All automatic. Jason On Wed, Jun 21, 2017 at 7:33 PM Glenna Kramer wrote: > Thank you for the thought! > > So, to clarify do you use funannotate predict on the maker gff files, > similar to the last example given here? > https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing > to give it a shot... > > Is brings up other questions for me, though. How do you do your > functional annotation? Maker? I noticed that funannotate will do functional > annotation, but currently was adding in my functional annotation using GAG > when I was converting the maker gff to tbl. > > Also, from what I understand, funannotate will output a gbk from the gff. > Do you have a particular file conversion tool to get that onto the sqn > format that you've had success with? > > Thanks, > Glenna > ________________________________________ > From: Jason Stajich [jason.stajich at gmail.com] > Sent: Wednesday, June 21, 2017 2:25 PM > To: Glenna Kramer; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] How to address errors encountered in process of > submitting genome > > Glenna - > > FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER > due to these issues with MAKER and fungal genomes I submit. > > Jason > > On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Tue Jun 13 12:35:28 2017 From: tfallon at mit.edu (Tim Fallon) Date: Tue, 13 Jun 2017 13:35:28 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes Message-ID: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Hi there, I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: protein_match_example.png Type: image/png Size: 142379 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Fri Jun 16 10:07:14 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 16 Jun 2017 11:07:14 -0400 Subject: [maker-devel] Database disk image is malformed error Message-ID: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Hi there, I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). Have you seen this error before? I?m thinking it could be a couple possibilities: 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. Thoughts? All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Thu Jun 22 01:39:28 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Thu, 22 Jun 2017 06:39:28 +0000 Subject: [maker-devel] Maker annotation of large scaffolds Message-ID: Hi All, I'm trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I'm afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: image002.png URL: From munholl at uwindsor.ca Thu Jun 22 10:43:22 2017 From: munholl at uwindsor.ca (Seth Munholland) Date: Thu, 22 Jun 2017 11:43:22 -0400 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD wrote: > Hi All, > > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish > the annotation for maximum scaffolds except 5 of them which are of size > around 100M base pairs. The current clusters in our institute has a time > limit of 24hrs for a job and these scaffolds could not be annotated with in > that time. > > Can you please suggest any other way of finishing the annotation for large > scaffolds? > > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would > split a gene into two. > > Thanks for your time. > > > > Regards, > > *Aravind PRASAD :: Research Officer :: > Comparative and Medical Genomics Lab :: Institue of Molecular and Cell > Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)* > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 > 9573 <+65%206586%209573> :: Fax (+65) 6779 1117 <+65%206779%201117> :: > http://www.imcb.a-star.edu.sg/ > > > > [image: 2] > > > > > > > Note: This message may contain confidential information. If this Email/Fax > has been sent to you by mistake, please notify the sender and delete it > immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: not available URL: From carsonhh at gmail.com Thu Jun 22 23:06:00 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:06:00 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson > On Jun 22, 2017, at 9:43 AM, Seth Munholland wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 <> > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > > Can you please suggest any other way of finishing the annotation for large scaffolds? > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > > Thanks for your time. > > > Regards, > > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 23:15:09 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:15:09 -0600 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Message-ID: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). Thanks, Carson > On Jun 16, 2017, at 9:07 AM, Tim Fallon wrote: > > Hi there, > > I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. > > Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: > > "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? > > I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). > > Have you seen this error before? I?m thinking it could be a couple possibilities: > 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. > 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. > 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. > > Thoughts? > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 23:27:02 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:27:02 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Message-ID: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. ?Carson > On Jun 13, 2017, at 11:35 AM, Tim Fallon wrote: > > Hi there, > > I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. > > I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. > > The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. > > Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? > > Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Thu Jun 22 23:31:58 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Fri, 23 Jun 2017 04:31:58 +0000 Subject: [maker-devel] Request a favor regarding MAKER In-Reply-To: References: Message-ID: <02EDBAC4-5338-4FE3-99AA-D98CF346A753@genetics.utah.edu> Sorry for the slow reply. This message somehow got overlooked. The lock is referring to a file lock. It usually means there is another active MAKER process that is trying to run in the same directory as your current maker process. This may mean you may have problems with the MPI setup if using MAKER under MPI. Or if you started MAKER multiple times simultaneously, then you got a collision when both are trying to work with the same data. Just kill all active MAKER processes and restart if that is the case. If it?s an MPI issue run maker with the -h flag added to the the current MPI command you are using to run MAKER. If it prints the help message more than once, then the MPI communication ring is having an issue. This could be a problem with how you installed MAKER or how you installed MPI. --Carson > On Jun 9, 2017, at 2:39 AM, shaf wrote: > > Greetings, > My name is Shaf and currently I'm using MAKER for my data. I did managed get some result using MAKER but i have problem with my storage. > > So I tried run maker on other directory with big space . > As far as I know i already set my maker can be run anywhere. > > I installed my maker on / > Then when i tried to run it on /media/nklee/2TB data/example$ maker ; i've got an error > > ERROR: The directory is locked. Perhaps by an instance of MAKER. > > --> rank=NA, hostname=Lee-Server > > I did checked it using nklee at Lee-Server:/media/nklee/2TB data$ maker > and my maker is there. > > May I know how to solve this problem?Thank you in advance. > > Regards, > Shaf > > > From tfallon at mit.edu Thu Jun 22 23:33:59 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:33:59 -0400 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Message-ID: Hi Carson, Thanks for the tip! The issue turned out that I needed using the ?-l? parameter for gff3_merge, to automatically rename the IDs when merging them, and also to pass the appropriate evidence in the merged GFF using the "Re-annotation Using MAKER Derived GFF3? parameters. I was using the more general parameters down below (protein_gff , est_gff etc). Seems to be working now, though I am still getting the hang of how to fix up misbehaving gene models. All the best, -Tim > On Jun 23, 2017, at 12:15 AM, Carson Holt wrote: > > Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). > > Thanks, > Carson > > > >> On Jun 16, 2017, at 9:07 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. >> >> Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: >> >> "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? >> >> I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). >> >> Have you seen this error before? I?m thinking it could be a couple possibilities: >> 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. >> 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. >> 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. >> >> Thoughts? >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Thu Jun 22 23:59:10 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:59:10 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> Message-ID: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Hi Carson, Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? All the best, -Tim > On Jun 23, 2017, at 12:27 AM, Carson Holt wrote: > > The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. > > ?Carson > > > >> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >> >> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >> >> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >> >> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >> >> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Fri Jun 23 03:25:18 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Fri, 23 Jun 2017 08:25:18 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Mon Jun 26 04:48:23 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Mon, 26 Jun 2017 09:48:23 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>, Message-ID: <1498470630221.84642@unil.ch> Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can't be run in the same directory). -Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 16:38:19 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:38:19 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498470630221.84642@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> Message-ID: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson > On Jun 26, 2017, at 3:48 AM, Patrick Tran Van wrote: > > Thanks for your answer. > > 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? > Because I am using autoAug for this and it tooks a while to compute .. > > 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: > > WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl > > (I am using v 2.31.8 ) > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 5, 2017 8:29 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Your plan sounds good. A couple of related notes. > > Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. > > Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). > > ?Carson > > >> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >> >> Hello, >> >> This is my first time running Maker for an insect genome annotation. >> >> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >> >> >> What I have: >> - RNA evidence: transcriptome >> - Proteine evidence: swissprot/uniprot + busco protein set of insect >> - Cegma and busco results of my genome >> >> >> 1) Train SNAP with CEGMA >> >> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >> >> 3) Create SNAP model from run A. >> >> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 5) Create SNAP model from run B. >> >> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 7) Create SNAP model from run C AND Create Augustus gene model from run C >> >> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >> >> >> >> Does it seems coherent ? >> >> Cheers, >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 16:48:03 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:03 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 16:48:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:46 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Also you can run MPI within a single node and not across nodes. This will still give a performance bonus equal to the MPI process count ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 17:00:24 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 16:00:24 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. ?Carson > On Jun 22, 2017, at 10:59 PM, Tim Fallon wrote: > > Hi Carson, > > Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. > > Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. > > Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? > > All the best, > -Tim > >> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >> >> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >> >> ?Carson >> >> >> >>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>> >>> Hi there, >>> >>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>> >>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>> >>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>> >>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>> >>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>> >>> All the best, >>> -Tim >>> >>> Timothy R. Fallon >>> PhD candidate >>> Laboratory of Jing-Ke Weng >>> Department of Biology >>> MIT >>> >>> tfallon at mit.edu >>> >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Tue Jun 27 02:07:50 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Tue, 27 Jun 2017 07:07:50 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> References: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> Message-ID: Thank you Carson for the explanation. The issue is now resolved for the annotation of large scaffolds with the use of MPI Maker as well as changing the blast_depth option. Aravind Prasad. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 27 June, 2017 5:48 AM To: Aravind PRASAD Cc: Seth Munholland; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson On Jun 23, 2017, at 2:25 AM, Aravind PRASAD > wrote: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at hovedpuden.dk Wed Jun 28 03:54:40 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Wed, 28 Jun 2017 10:54:40 +0200 Subject: [maker-devel] maker with MPI and perl using threads Message-ID: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Hello, Recently I assisted one of my customers with problems solving maker using MPI. It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. During the debugging we also found that it was beneficial to have the latest mxm.c installed: https://community.mellanox.com/thread/3439 Best Regards John Damm S?rensen IT consultant From carsonhh at gmail.com Thu Jun 29 15:43:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:43:21 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). Thanks, Carson > On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: > > Hello, > > Recently I assisted one of my customers with problems solving maker using MPI. > > It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. > > In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. > > I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. > > During the debugging we also found that it was beneficial to have the latest mxm.c installed: > > https://community.mellanox.com/thread/3439 > > > Best Regards > > John Damm S?rensen > > IT consultant > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jun 29 15:56:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:56:46 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. ?Carson > On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: > > MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. > > If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. > > I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). > > Thanks, > Carson > > >> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >> >> Hello, >> >> Recently I assisted one of my customers with problems solving maker using MPI. >> >> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >> >> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >> >> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >> >> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >> >> https://community.mellanox.com/thread/3439 >> >> >> Best Regards >> >> John Damm S?rensen >> >> IT consultant >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From qlian003 at ucr.edu Fri Jun 30 14:30:19 2017 From: qlian003 at ucr.edu (Qihua Liang) Date: Fri, 30 Jun 2017 12:30:19 -0700 Subject: [maker-devel] Possible ways to improve annotated gene numbers Message-ID: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Dear Maker Development Team, Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. But for all.maker.proteins.fasta, the completeness is only ~80%. 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? Thank you Qihua From Patrick.TranVan at unil.ch Fri Jun 2 03:56:30 2017 From: Patrick.TranVan at unil.ch (Patrick Tran Van) Date: Fri, 2 Jun 2017 09:56:30 +0000 Subject: [maker-devel] Advice on my pipeline Message-ID: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Jun 5 12:24:47 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 5 Jun 2017 18:24:47 +0000 Subject: [maker-devel] Plant genome annotation In-Reply-To: References: Message-ID: <5DD47274-C5FA-404D-A7EC-AADE0325EA03@genetics.utah.edu> MAKER wiki ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 Book chapter on MAKER protocol ?> http://www.yandell-lab.org/publications/pdf/maker_current_protocols.pdf Mailing list ?> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org Searchable archive of common maker related questions ?> https://groups.google.com/forum/#!forum/maker-devel ?Carson On Jun 5, 2017, at 8:18 AM, Muhammad Arslan > wrote: Dear Carson, I am writing this email to ask you a favor from you regarding the usage of Maker-P. I want to use the application for plant genome annotation however has very little knowledge of doing so! Is there any step-by-step tutorial available for doing so? I would be very thankful to you! Best regards -- -------------------------------------------------------------------------------------------- Muhammad Arslan PhD Student / Guest Scientist Department of Environmental Biotechnology Helmholtz Centre for Environmental Research - UFZ Permoserstra?e 15, 04318 Leipzig, Germany Phone +49,341,235 1696, muhammad.arslan at ufz.de , www.ufz.de Registered Office / Registered Office: Leipzig Register court / Registration Office: Amtsgericht Leipzig Commercial register Nr./Trade Register No .: B 4703 Chairman / Chairman of the Supervisory Board: MinDirig Wilfried Kraus Scientific Director / Scientific Managing Director: Prof. Georg Teutsch Administrative Managing Director / Administrative Managing Director: Prof. Dr. Heike Grassmann -------------------------------------------------------------------------------------------- SAVE PAPER - Please do not print this e-mail unless absolutely necessary -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 5 12:29:57 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 5 Jun 2017 12:29:57 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Message-ID: Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson > On Jun 2, 2017, at 3:56 AM, Patrick Tran Van wrote: > > Hello, > > This is my first time running Maker for an insect genome annotation. > > I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: > > > What I have: > - RNA evidence: transcriptome > - Proteine evidence: swissprot/uniprot + busco protein set of insect > - Cegma and busco results of my genome > > > 1) Train SNAP with CEGMA > > 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). > > 3) Create SNAP model from run A. > > 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 5) Create SNAP model from run B. > > 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 7) Create SNAP model from run C AND Create Augustus gene model from run C > > 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 > > > > Does it seems coherent ? > > Cheers, > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 09:51:11 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Wed, 21 Jun 2017 15:51:11 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Jun 21 12:25:43 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 21 Jun 2017 18:25:43 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 20:33:15 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Thu, 22 Jun 2017 02:33:15 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca>, Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Thank you for the thought! So, to clarify do you use funannotate predict on the maker gff files, similar to the last example given here? https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing to give it a shot... Is brings up other questions for me, though. How do you do your functional annotation? Maker? I noticed that funannotate will do functional annotation, but currently was adding in my functional annotation using GAG when I was converting the maker gff to tbl. Also, from what I understand, funannotate will output a gbk from the gff. Do you have a particular file conversion tool to get that onto the sqn format that you've had success with? Thanks, Glenna ________________________________________ From: Jason Stajich [jason.stajich at gmail.com] Sent: Wednesday, June 21, 2017 2:25 PM To: Glenna Kramer; maker-devel at yandell-lab.org Subject: Re: [maker-devel] How to address errors encountered in process of submitting genome Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jason.stajich at gmail.com Wed Jun 21 22:09:50 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 22 Jun 2017 04:09:50 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Quick answers for now. A) you can feed maker gff to Funannotate or run it alone B) I run the annotate step in funannotate but generally transfer only swissprot annots as product desc. Have to manually edit to remove systematic orf names In product desc that NCBI will flag - e.g. YAL001W, AN1234, ARB_xx. You have to edit the annotations.swissprot.txt file to use the product descriptor if you want to promote these to full product descriptions in the resulting .tbl file May want to run iprscan locally or wait for it running remotely to get GO assignments included. C) you get .tbl and Fsa Files from gag and these are processed by tbl2asn to get sqn file. All are produced in the result file. All automatic. Jason On Wed, Jun 21, 2017 at 7:33 PM Glenna Kramer wrote: > Thank you for the thought! > > So, to clarify do you use funannotate predict on the maker gff files, > similar to the last example given here? > https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing > to give it a shot... > > Is brings up other questions for me, though. How do you do your > functional annotation? Maker? I noticed that funannotate will do functional > annotation, but currently was adding in my functional annotation using GAG > when I was converting the maker gff to tbl. > > Also, from what I understand, funannotate will output a gbk from the gff. > Do you have a particular file conversion tool to get that onto the sqn > format that you've had success with? > > Thanks, > Glenna > ________________________________________ > From: Jason Stajich [jason.stajich at gmail.com] > Sent: Wednesday, June 21, 2017 2:25 PM > To: Glenna Kramer; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] How to address errors encountered in process of > submitting genome > > Glenna - > > FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER > due to these issues with MAKER and fungal genomes I submit. > > Jason > > On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Tue Jun 13 11:35:28 2017 From: tfallon at mit.edu (Tim Fallon) Date: Tue, 13 Jun 2017 13:35:28 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes Message-ID: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Hi there, I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: protein_match_example.png Type: image/png Size: 142379 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Fri Jun 16 09:07:14 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 16 Jun 2017 11:07:14 -0400 Subject: [maker-devel] Database disk image is malformed error Message-ID: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Hi there, I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). Have you seen this error before? I?m thinking it could be a couple possibilities: 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. Thoughts? All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Thu Jun 22 00:39:28 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Thu, 22 Jun 2017 06:39:28 +0000 Subject: [maker-devel] Maker annotation of large scaffolds Message-ID: Hi All, I'm trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I'm afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: image002.png URL: From munholl at uwindsor.ca Thu Jun 22 09:43:22 2017 From: munholl at uwindsor.ca (Seth Munholland) Date: Thu, 22 Jun 2017 11:43:22 -0400 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD wrote: > Hi All, > > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish > the annotation for maximum scaffolds except 5 of them which are of size > around 100M base pairs. The current clusters in our institute has a time > limit of 24hrs for a job and these scaffolds could not be annotated with in > that time. > > Can you please suggest any other way of finishing the annotation for large > scaffolds? > > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would > split a gene into two. > > Thanks for your time. > > > > Regards, > > *Aravind PRASAD :: Research Officer :: > Comparative and Medical Genomics Lab :: Institue of Molecular and Cell > Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)* > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 > 9573 <+65%206586%209573> :: Fax (+65) 6779 1117 <+65%206779%201117> :: > http://www.imcb.a-star.edu.sg/ > > > > [image: 2] > > > > > > > Note: This message may contain confidential information. If this Email/Fax > has been sent to you by mistake, please notify the sender and delete it > immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: not available URL: From carsonhh at gmail.com Thu Jun 22 22:06:00 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:06:00 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson > On Jun 22, 2017, at 9:43 AM, Seth Munholland wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 <> > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > > Can you please suggest any other way of finishing the annotation for large scaffolds? > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > > Thanks for your time. > > > Regards, > > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 22:15:09 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:15:09 -0600 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Message-ID: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). Thanks, Carson > On Jun 16, 2017, at 9:07 AM, Tim Fallon wrote: > > Hi there, > > I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. > > Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: > > "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? > > I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). > > Have you seen this error before? I?m thinking it could be a couple possibilities: > 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. > 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. > 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. > > Thoughts? > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 22:27:02 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:27:02 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Message-ID: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. ?Carson > On Jun 13, 2017, at 11:35 AM, Tim Fallon wrote: > > Hi there, > > I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. > > I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. > > The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. > > Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? > > Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Thu Jun 22 22:31:58 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Fri, 23 Jun 2017 04:31:58 +0000 Subject: [maker-devel] Request a favor regarding MAKER In-Reply-To: References: Message-ID: <02EDBAC4-5338-4FE3-99AA-D98CF346A753@genetics.utah.edu> Sorry for the slow reply. This message somehow got overlooked. The lock is referring to a file lock. It usually means there is another active MAKER process that is trying to run in the same directory as your current maker process. This may mean you may have problems with the MPI setup if using MAKER under MPI. Or if you started MAKER multiple times simultaneously, then you got a collision when both are trying to work with the same data. Just kill all active MAKER processes and restart if that is the case. If it?s an MPI issue run maker with the -h flag added to the the current MPI command you are using to run MAKER. If it prints the help message more than once, then the MPI communication ring is having an issue. This could be a problem with how you installed MAKER or how you installed MPI. --Carson > On Jun 9, 2017, at 2:39 AM, shaf wrote: > > Greetings, > My name is Shaf and currently I'm using MAKER for my data. I did managed get some result using MAKER but i have problem with my storage. > > So I tried run maker on other directory with big space . > As far as I know i already set my maker can be run anywhere. > > I installed my maker on / > Then when i tried to run it on /media/nklee/2TB data/example$ maker ; i've got an error > > ERROR: The directory is locked. Perhaps by an instance of MAKER. > > --> rank=NA, hostname=Lee-Server > > I did checked it using nklee at Lee-Server:/media/nklee/2TB data$ maker > and my maker is there. > > May I know how to solve this problem?Thank you in advance. > > Regards, > Shaf > > > From tfallon at mit.edu Thu Jun 22 22:33:59 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:33:59 -0400 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Message-ID: Hi Carson, Thanks for the tip! The issue turned out that I needed using the ?-l? parameter for gff3_merge, to automatically rename the IDs when merging them, and also to pass the appropriate evidence in the merged GFF using the "Re-annotation Using MAKER Derived GFF3? parameters. I was using the more general parameters down below (protein_gff , est_gff etc). Seems to be working now, though I am still getting the hang of how to fix up misbehaving gene models. All the best, -Tim > On Jun 23, 2017, at 12:15 AM, Carson Holt wrote: > > Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). > > Thanks, > Carson > > > >> On Jun 16, 2017, at 9:07 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. >> >> Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: >> >> "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? >> >> I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). >> >> Have you seen this error before? I?m thinking it could be a couple possibilities: >> 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. >> 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. >> 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. >> >> Thoughts? >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Thu Jun 22 22:59:10 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:59:10 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> Message-ID: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Hi Carson, Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? All the best, -Tim > On Jun 23, 2017, at 12:27 AM, Carson Holt wrote: > > The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. > > ?Carson > > > >> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >> >> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >> >> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >> >> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >> >> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Fri Jun 23 02:25:18 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Fri, 23 Jun 2017 08:25:18 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Mon Jun 26 03:48:23 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Mon, 26 Jun 2017 09:48:23 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>, Message-ID: <1498470630221.84642@unil.ch> Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can't be run in the same directory). -Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:38:19 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:38:19 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498470630221.84642@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> Message-ID: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson > On Jun 26, 2017, at 3:48 AM, Patrick Tran Van wrote: > > Thanks for your answer. > > 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? > Because I am using autoAug for this and it tooks a while to compute .. > > 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: > > WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl > > (I am using v 2.31.8 ) > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 5, 2017 8:29 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Your plan sounds good. A couple of related notes. > > Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. > > Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). > > ?Carson > > >> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >> >> Hello, >> >> This is my first time running Maker for an insect genome annotation. >> >> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >> >> >> What I have: >> - RNA evidence: transcriptome >> - Proteine evidence: swissprot/uniprot + busco protein set of insect >> - Cegma and busco results of my genome >> >> >> 1) Train SNAP with CEGMA >> >> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >> >> 3) Create SNAP model from run A. >> >> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 5) Create SNAP model from run B. >> >> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 7) Create SNAP model from run C AND Create Augustus gene model from run C >> >> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >> >> >> >> Does it seems coherent ? >> >> Cheers, >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:48:03 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:03 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:48:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:46 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Also you can run MPI within a single node and not across nodes. This will still give a performance bonus equal to the MPI process count ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 16:00:24 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 16:00:24 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. ?Carson > On Jun 22, 2017, at 10:59 PM, Tim Fallon wrote: > > Hi Carson, > > Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. > > Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. > > Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? > > All the best, > -Tim > >> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >> >> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >> >> ?Carson >> >> >> >>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>> >>> Hi there, >>> >>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>> >>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>> >>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>> >>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>> >>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>> >>> All the best, >>> -Tim >>> >>> Timothy R. Fallon >>> PhD candidate >>> Laboratory of Jing-Ke Weng >>> Department of Biology >>> MIT >>> >>> tfallon at mit.edu >>> >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Tue Jun 27 01:07:50 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Tue, 27 Jun 2017 07:07:50 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> References: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> Message-ID: Thank you Carson for the explanation. The issue is now resolved for the annotation of large scaffolds with the use of MPI Maker as well as changing the blast_depth option. Aravind Prasad. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 27 June, 2017 5:48 AM To: Aravind PRASAD Cc: Seth Munholland; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson On Jun 23, 2017, at 2:25 AM, Aravind PRASAD > wrote: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at hovedpuden.dk Wed Jun 28 02:54:40 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Wed, 28 Jun 2017 10:54:40 +0200 Subject: [maker-devel] maker with MPI and perl using threads Message-ID: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Hello, Recently I assisted one of my customers with problems solving maker using MPI. It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. During the debugging we also found that it was beneficial to have the latest mxm.c installed: https://community.mellanox.com/thread/3439 Best Regards John Damm S?rensen IT consultant From carsonhh at gmail.com Thu Jun 29 14:43:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:43:21 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). Thanks, Carson > On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: > > Hello, > > Recently I assisted one of my customers with problems solving maker using MPI. > > It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. > > In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. > > I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. > > During the debugging we also found that it was beneficial to have the latest mxm.c installed: > > https://community.mellanox.com/thread/3439 > > > Best Regards > > John Damm S?rensen > > IT consultant > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jun 29 14:56:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:56:46 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. ?Carson > On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: > > MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. > > If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. > > I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). > > Thanks, > Carson > > >> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >> >> Hello, >> >> Recently I assisted one of my customers with problems solving maker using MPI. >> >> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >> >> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >> >> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >> >> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >> >> https://community.mellanox.com/thread/3439 >> >> >> Best Regards >> >> John Damm S?rensen >> >> IT consultant >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From qlian003 at ucr.edu Fri Jun 30 13:30:19 2017 From: qlian003 at ucr.edu (Qihua Liang) Date: Fri, 30 Jun 2017 12:30:19 -0700 Subject: [maker-devel] Possible ways to improve annotated gene numbers Message-ID: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Dear Maker Development Team, Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. But for all.maker.proteins.fasta, the completeness is only ~80%. 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? Thank you Qihua From Patrick.TranVan at unil.ch Fri Jun 2 03:56:30 2017 From: Patrick.TranVan at unil.ch (Patrick Tran Van) Date: Fri, 2 Jun 2017 09:56:30 +0000 Subject: [maker-devel] Advice on my pipeline Message-ID: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Jun 5 12:24:47 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 5 Jun 2017 18:24:47 +0000 Subject: [maker-devel] Plant genome annotation In-Reply-To: References: Message-ID: <5DD47274-C5FA-404D-A7EC-AADE0325EA03@genetics.utah.edu> MAKER wiki ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 Book chapter on MAKER protocol ?> http://www.yandell-lab.org/publications/pdf/maker_current_protocols.pdf Mailing list ?> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org Searchable archive of common maker related questions ?> https://groups.google.com/forum/#!forum/maker-devel ?Carson On Jun 5, 2017, at 8:18 AM, Muhammad Arslan > wrote: Dear Carson, I am writing this email to ask you a favor from you regarding the usage of Maker-P. I want to use the application for plant genome annotation however has very little knowledge of doing so! Is there any step-by-step tutorial available for doing so? I would be very thankful to you! Best regards -- -------------------------------------------------------------------------------------------- Muhammad Arslan PhD Student / Guest Scientist Department of Environmental Biotechnology Helmholtz Centre for Environmental Research - UFZ Permoserstra?e 15, 04318 Leipzig, Germany Phone +49,341,235 1696, muhammad.arslan at ufz.de , www.ufz.de Registered Office / Registered Office: Leipzig Register court / Registration Office: Amtsgericht Leipzig Commercial register Nr./Trade Register No .: B 4703 Chairman / Chairman of the Supervisory Board: MinDirig Wilfried Kraus Scientific Director / Scientific Managing Director: Prof. Georg Teutsch Administrative Managing Director / Administrative Managing Director: Prof. Dr. Heike Grassmann -------------------------------------------------------------------------------------------- SAVE PAPER - Please do not print this e-mail unless absolutely necessary -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 5 12:29:57 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 5 Jun 2017 12:29:57 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Message-ID: Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson > On Jun 2, 2017, at 3:56 AM, Patrick Tran Van wrote: > > Hello, > > This is my first time running Maker for an insect genome annotation. > > I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: > > > What I have: > - RNA evidence: transcriptome > - Proteine evidence: swissprot/uniprot + busco protein set of insect > - Cegma and busco results of my genome > > > 1) Train SNAP with CEGMA > > 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). > > 3) Create SNAP model from run A. > > 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 5) Create SNAP model from run B. > > 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 7) Create SNAP model from run C AND Create Augustus gene model from run C > > 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 > > > > Does it seems coherent ? > > Cheers, > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 09:51:11 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Wed, 21 Jun 2017 15:51:11 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Jun 21 12:25:43 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 21 Jun 2017 18:25:43 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 20:33:15 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Thu, 22 Jun 2017 02:33:15 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca>, Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Thank you for the thought! So, to clarify do you use funannotate predict on the maker gff files, similar to the last example given here? https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing to give it a shot... Is brings up other questions for me, though. How do you do your functional annotation? Maker? I noticed that funannotate will do functional annotation, but currently was adding in my functional annotation using GAG when I was converting the maker gff to tbl. Also, from what I understand, funannotate will output a gbk from the gff. Do you have a particular file conversion tool to get that onto the sqn format that you've had success with? Thanks, Glenna ________________________________________ From: Jason Stajich [jason.stajich at gmail.com] Sent: Wednesday, June 21, 2017 2:25 PM To: Glenna Kramer; maker-devel at yandell-lab.org Subject: Re: [maker-devel] How to address errors encountered in process of submitting genome Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jason.stajich at gmail.com Wed Jun 21 22:09:50 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 22 Jun 2017 04:09:50 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Quick answers for now. A) you can feed maker gff to Funannotate or run it alone B) I run the annotate step in funannotate but generally transfer only swissprot annots as product desc. Have to manually edit to remove systematic orf names In product desc that NCBI will flag - e.g. YAL001W, AN1234, ARB_xx. You have to edit the annotations.swissprot.txt file to use the product descriptor if you want to promote these to full product descriptions in the resulting .tbl file May want to run iprscan locally or wait for it running remotely to get GO assignments included. C) you get .tbl and Fsa Files from gag and these are processed by tbl2asn to get sqn file. All are produced in the result file. All automatic. Jason On Wed, Jun 21, 2017 at 7:33 PM Glenna Kramer wrote: > Thank you for the thought! > > So, to clarify do you use funannotate predict on the maker gff files, > similar to the last example given here? > https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing > to give it a shot... > > Is brings up other questions for me, though. How do you do your > functional annotation? Maker? I noticed that funannotate will do functional > annotation, but currently was adding in my functional annotation using GAG > when I was converting the maker gff to tbl. > > Also, from what I understand, funannotate will output a gbk from the gff. > Do you have a particular file conversion tool to get that onto the sqn > format that you've had success with? > > Thanks, > Glenna > ________________________________________ > From: Jason Stajich [jason.stajich at gmail.com] > Sent: Wednesday, June 21, 2017 2:25 PM > To: Glenna Kramer; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] How to address errors encountered in process of > submitting genome > > Glenna - > > FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER > due to these issues with MAKER and fungal genomes I submit. > > Jason > > On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Tue Jun 13 11:35:28 2017 From: tfallon at mit.edu (Tim Fallon) Date: Tue, 13 Jun 2017 13:35:28 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes Message-ID: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Hi there, I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: protein_match_example.png Type: image/png Size: 142379 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Fri Jun 16 09:07:14 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 16 Jun 2017 11:07:14 -0400 Subject: [maker-devel] Database disk image is malformed error Message-ID: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Hi there, I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). Have you seen this error before? I?m thinking it could be a couple possibilities: 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. Thoughts? All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Thu Jun 22 00:39:28 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Thu, 22 Jun 2017 06:39:28 +0000 Subject: [maker-devel] Maker annotation of large scaffolds Message-ID: Hi All, I'm trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I'm afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: image002.png URL: From munholl at uwindsor.ca Thu Jun 22 09:43:22 2017 From: munholl at uwindsor.ca (Seth Munholland) Date: Thu, 22 Jun 2017 11:43:22 -0400 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD wrote: > Hi All, > > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish > the annotation for maximum scaffolds except 5 of them which are of size > around 100M base pairs. The current clusters in our institute has a time > limit of 24hrs for a job and these scaffolds could not be annotated with in > that time. > > Can you please suggest any other way of finishing the annotation for large > scaffolds? > > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would > split a gene into two. > > Thanks for your time. > > > > Regards, > > *Aravind PRASAD :: Research Officer :: > Comparative and Medical Genomics Lab :: Institue of Molecular and Cell > Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)* > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 > 9573 <+65%206586%209573> :: Fax (+65) 6779 1117 <+65%206779%201117> :: > http://www.imcb.a-star.edu.sg/ > > > > [image: 2] > > > > > > > Note: This message may contain confidential information. If this Email/Fax > has been sent to you by mistake, please notify the sender and delete it > immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: not available URL: From carsonhh at gmail.com Thu Jun 22 22:06:00 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:06:00 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson > On Jun 22, 2017, at 9:43 AM, Seth Munholland wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 <> > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > > Can you please suggest any other way of finishing the annotation for large scaffolds? > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > > Thanks for your time. > > > Regards, > > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 22:15:09 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:15:09 -0600 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Message-ID: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). Thanks, Carson > On Jun 16, 2017, at 9:07 AM, Tim Fallon wrote: > > Hi there, > > I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. > > Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: > > "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? > > I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). > > Have you seen this error before? I?m thinking it could be a couple possibilities: > 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. > 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. > 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. > > Thoughts? > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 22:27:02 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:27:02 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Message-ID: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. ?Carson > On Jun 13, 2017, at 11:35 AM, Tim Fallon wrote: > > Hi there, > > I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. > > I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. > > The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. > > Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? > > Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Thu Jun 22 22:31:58 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Fri, 23 Jun 2017 04:31:58 +0000 Subject: [maker-devel] Request a favor regarding MAKER In-Reply-To: References: Message-ID: <02EDBAC4-5338-4FE3-99AA-D98CF346A753@genetics.utah.edu> Sorry for the slow reply. This message somehow got overlooked. The lock is referring to a file lock. It usually means there is another active MAKER process that is trying to run in the same directory as your current maker process. This may mean you may have problems with the MPI setup if using MAKER under MPI. Or if you started MAKER multiple times simultaneously, then you got a collision when both are trying to work with the same data. Just kill all active MAKER processes and restart if that is the case. If it?s an MPI issue run maker with the -h flag added to the the current MPI command you are using to run MAKER. If it prints the help message more than once, then the MPI communication ring is having an issue. This could be a problem with how you installed MAKER or how you installed MPI. --Carson > On Jun 9, 2017, at 2:39 AM, shaf wrote: > > Greetings, > My name is Shaf and currently I'm using MAKER for my data. I did managed get some result using MAKER but i have problem with my storage. > > So I tried run maker on other directory with big space . > As far as I know i already set my maker can be run anywhere. > > I installed my maker on / > Then when i tried to run it on /media/nklee/2TB data/example$ maker ; i've got an error > > ERROR: The directory is locked. Perhaps by an instance of MAKER. > > --> rank=NA, hostname=Lee-Server > > I did checked it using nklee at Lee-Server:/media/nklee/2TB data$ maker > and my maker is there. > > May I know how to solve this problem?Thank you in advance. > > Regards, > Shaf > > > From tfallon at mit.edu Thu Jun 22 22:33:59 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:33:59 -0400 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Message-ID: Hi Carson, Thanks for the tip! The issue turned out that I needed using the ?-l? parameter for gff3_merge, to automatically rename the IDs when merging them, and also to pass the appropriate evidence in the merged GFF using the "Re-annotation Using MAKER Derived GFF3? parameters. I was using the more general parameters down below (protein_gff , est_gff etc). Seems to be working now, though I am still getting the hang of how to fix up misbehaving gene models. All the best, -Tim > On Jun 23, 2017, at 12:15 AM, Carson Holt wrote: > > Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). > > Thanks, > Carson > > > >> On Jun 16, 2017, at 9:07 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. >> >> Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: >> >> "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? >> >> I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). >> >> Have you seen this error before? I?m thinking it could be a couple possibilities: >> 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. >> 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. >> 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. >> >> Thoughts? >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Thu Jun 22 22:59:10 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:59:10 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> Message-ID: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Hi Carson, Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? All the best, -Tim > On Jun 23, 2017, at 12:27 AM, Carson Holt wrote: > > The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. > > ?Carson > > > >> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >> >> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >> >> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >> >> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >> >> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Fri Jun 23 02:25:18 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Fri, 23 Jun 2017 08:25:18 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Mon Jun 26 03:48:23 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Mon, 26 Jun 2017 09:48:23 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>, Message-ID: <1498470630221.84642@unil.ch> Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can't be run in the same directory). -Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:38:19 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:38:19 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498470630221.84642@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> Message-ID: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson > On Jun 26, 2017, at 3:48 AM, Patrick Tran Van wrote: > > Thanks for your answer. > > 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? > Because I am using autoAug for this and it tooks a while to compute .. > > 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: > > WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl > > (I am using v 2.31.8 ) > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 5, 2017 8:29 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Your plan sounds good. A couple of related notes. > > Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. > > Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). > > ?Carson > > >> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >> >> Hello, >> >> This is my first time running Maker for an insect genome annotation. >> >> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >> >> >> What I have: >> - RNA evidence: transcriptome >> - Proteine evidence: swissprot/uniprot + busco protein set of insect >> - Cegma and busco results of my genome >> >> >> 1) Train SNAP with CEGMA >> >> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >> >> 3) Create SNAP model from run A. >> >> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 5) Create SNAP model from run B. >> >> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 7) Create SNAP model from run C AND Create Augustus gene model from run C >> >> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >> >> >> >> Does it seems coherent ? >> >> Cheers, >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:48:03 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:03 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:48:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:46 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Also you can run MPI within a single node and not across nodes. This will still give a performance bonus equal to the MPI process count ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 16:00:24 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 16:00:24 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. ?Carson > On Jun 22, 2017, at 10:59 PM, Tim Fallon wrote: > > Hi Carson, > > Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. > > Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. > > Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? > > All the best, > -Tim > >> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >> >> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >> >> ?Carson >> >> >> >>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>> >>> Hi there, >>> >>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>> >>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>> >>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>> >>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>> >>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>> >>> All the best, >>> -Tim >>> >>> Timothy R. Fallon >>> PhD candidate >>> Laboratory of Jing-Ke Weng >>> Department of Biology >>> MIT >>> >>> tfallon at mit.edu >>> >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Tue Jun 27 01:07:50 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Tue, 27 Jun 2017 07:07:50 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> References: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> Message-ID: Thank you Carson for the explanation. The issue is now resolved for the annotation of large scaffolds with the use of MPI Maker as well as changing the blast_depth option. Aravind Prasad. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 27 June, 2017 5:48 AM To: Aravind PRASAD Cc: Seth Munholland; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson On Jun 23, 2017, at 2:25 AM, Aravind PRASAD > wrote: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at hovedpuden.dk Wed Jun 28 02:54:40 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Wed, 28 Jun 2017 10:54:40 +0200 Subject: [maker-devel] maker with MPI and perl using threads Message-ID: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Hello, Recently I assisted one of my customers with problems solving maker using MPI. It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. During the debugging we also found that it was beneficial to have the latest mxm.c installed: https://community.mellanox.com/thread/3439 Best Regards John Damm S?rensen IT consultant From carsonhh at gmail.com Thu Jun 29 14:43:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:43:21 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). Thanks, Carson > On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: > > Hello, > > Recently I assisted one of my customers with problems solving maker using MPI. > > It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. > > In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. > > I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. > > During the debugging we also found that it was beneficial to have the latest mxm.c installed: > > https://community.mellanox.com/thread/3439 > > > Best Regards > > John Damm S?rensen > > IT consultant > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jun 29 14:56:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:56:46 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. ?Carson > On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: > > MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. > > If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. > > I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). > > Thanks, > Carson > > >> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >> >> Hello, >> >> Recently I assisted one of my customers with problems solving maker using MPI. >> >> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >> >> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >> >> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >> >> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >> >> https://community.mellanox.com/thread/3439 >> >> >> Best Regards >> >> John Damm S?rensen >> >> IT consultant >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From qlian003 at ucr.edu Fri Jun 30 13:30:19 2017 From: qlian003 at ucr.edu (Qihua Liang) Date: Fri, 30 Jun 2017 12:30:19 -0700 Subject: [maker-devel] Possible ways to improve annotated gene numbers Message-ID: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Dear Maker Development Team, Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. But for all.maker.proteins.fasta, the completeness is only ~80%. 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? Thank you Qihua From Patrick.TranVan at unil.ch Fri Jun 2 03:56:30 2017 From: Patrick.TranVan at unil.ch (Patrick Tran Van) Date: Fri, 2 Jun 2017 09:56:30 +0000 Subject: [maker-devel] Advice on my pipeline Message-ID: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Jun 5 12:24:47 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Mon, 5 Jun 2017 18:24:47 +0000 Subject: [maker-devel] Plant genome annotation In-Reply-To: References: Message-ID: <5DD47274-C5FA-404D-A7EC-AADE0325EA03@genetics.utah.edu> MAKER wiki ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 Book chapter on MAKER protocol ?> http://www.yandell-lab.org/publications/pdf/maker_current_protocols.pdf Mailing list ?> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org Searchable archive of common maker related questions ?> https://groups.google.com/forum/#!forum/maker-devel ?Carson On Jun 5, 2017, at 8:18 AM, Muhammad Arslan > wrote: Dear Carson, I am writing this email to ask you a favor from you regarding the usage of Maker-P. I want to use the application for plant genome annotation however has very little knowledge of doing so! Is there any step-by-step tutorial available for doing so? I would be very thankful to you! Best regards -- -------------------------------------------------------------------------------------------- Muhammad Arslan PhD Student / Guest Scientist Department of Environmental Biotechnology Helmholtz Centre for Environmental Research - UFZ Permoserstra?e 15, 04318 Leipzig, Germany Phone +49,341,235 1696, muhammad.arslan at ufz.de , www.ufz.de Registered Office / Registered Office: Leipzig Register court / Registration Office: Amtsgericht Leipzig Commercial register Nr./Trade Register No .: B 4703 Chairman / Chairman of the Supervisory Board: MinDirig Wilfried Kraus Scientific Director / Scientific Managing Director: Prof. Georg Teutsch Administrative Managing Director / Administrative Managing Director: Prof. Dr. Heike Grassmann -------------------------------------------------------------------------------------------- SAVE PAPER - Please do not print this e-mail unless absolutely necessary -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 5 12:29:57 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 5 Jun 2017 12:29:57 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> Message-ID: Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson > On Jun 2, 2017, at 3:56 AM, Patrick Tran Van wrote: > > Hello, > > This is my first time running Maker for an insect genome annotation. > > I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: > > > What I have: > - RNA evidence: transcriptome > - Proteine evidence: swissprot/uniprot + busco protein set of insect > - Cegma and busco results of my genome > > > 1) Train SNAP with CEGMA > > 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). > > 3) Create SNAP model from run A. > > 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 5) Create SNAP model from run B. > > 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). > > 7) Create SNAP model from run C AND Create Augustus gene model from run C > > 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 > > > > Does it seems coherent ? > > Cheers, > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 09:51:11 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Wed, 21 Jun 2017 15:51:11 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Wed Jun 21 12:25:43 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 21 Jun 2017 18:25:43 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From glenna.kramer at utoronto.ca Wed Jun 21 20:33:15 2017 From: glenna.kramer at utoronto.ca (Glenna Kramer) Date: Thu, 22 Jun 2017 02:33:15 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca>, Message-ID: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Thank you for the thought! So, to clarify do you use funannotate predict on the maker gff files, similar to the last example given here? https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing to give it a shot... Is brings up other questions for me, though. How do you do your functional annotation? Maker? I noticed that funannotate will do functional annotation, but currently was adding in my functional annotation using GAG when I was converting the maker gff to tbl. Also, from what I understand, funannotate will output a gbk from the gff. Do you have a particular file conversion tool to get that onto the sqn format that you've had success with? Thanks, Glenna ________________________________________ From: Jason Stajich [jason.stajich at gmail.com] Sent: Wednesday, June 21, 2017 2:25 PM To: Glenna Kramer; maker-devel at yandell-lab.org Subject: Re: [maker-devel] How to address errors encountered in process of submitting genome Glenna - FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER due to these issues with MAKER and fungal genomes I submit. Jason On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: Hi there, I am attempting to submit a fungal genome to NCBI and have run into quite a few errors running tbl2asn. I know this isn't directly related to MAKER, but I'm hoping that someone here has been through this process and would be able to give some insight (or at least point me in the direction of another knowledgeable source)! Here is a general overview of the process that have used so far: 1. Converted MAKER GFF3 files to tbl files using GAG (ran options remove_introns_shorter_than 10 and fix_start_stop and added functional annotation as well). This seems to work well to convert the GFF3 to tbl. 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, but I am getting lots of errors in the .val output file, which I am unsure how to address. There are... 56 ERROR: SEQ_FEAT.BadTrailingHyphen 1525 ERROR: SEQ_FEAT.InternalStop 1142 ERROR: SEQ_FEAT.NoStop 10 ERROR: SEQ_FEAT.PartialProblem 2368 ERROR: SEQ_FEAT.StartCodon 2368 ERROR: SEQ_INST.BadProteinStart 1525 ERROR: SEQ_INST.StopInProtein 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 83 WARNING: SEQ_FEAT.PartialProblem 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket 19 WARNING: SEQ_FEAT.ShortExon 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor Also, just as a side note, has anyone tried the new table2asn_GFF converter that is up to convert GFF3 directly to sqn? I was thinking that I would give that a shot hoping that it would help with some of these errors. However, I was instantly met with an error as well. "Too many positional arguments (1), the offending value: ends." Thank you so much in advance for any help you are able to give! Glenna _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jason.stajich at gmail.com Wed Jun 21 22:09:50 2017 From: jason.stajich at gmail.com (Jason Stajich) Date: Thu, 22 Jun 2017 04:09:50 +0000 Subject: [maker-devel] How to address errors encountered in process of submitting genome In-Reply-To: <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> References: <4781C7F0FC2DAA4BBC18FC44DC9D09AE015065617A@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> <4781C7F0FC2DAA4BBC18FC44DC9D09AE01506571C4@ArborExMBx4P.UTORARBOR.UTORAD.Utoronto.ca> Message-ID: Quick answers for now. A) you can feed maker gff to Funannotate or run it alone B) I run the annotate step in funannotate but generally transfer only swissprot annots as product desc. Have to manually edit to remove systematic orf names In product desc that NCBI will flag - e.g. YAL001W, AN1234, ARB_xx. You have to edit the annotations.swissprot.txt file to use the product descriptor if you want to promote these to full product descriptions in the resulting .tbl file May want to run iprscan locally or wait for it running remotely to get GO assignments included. C) you get .tbl and Fsa Files from gag and these are processed by tbl2asn to get sqn file. All are produced in the result file. All automatic. Jason On Wed, Jun 21, 2017 at 7:33 PM Glenna Kramer wrote: > Thank you for the thought! > > So, to clarify do you use funannotate predict on the maker gff files, > similar to the last example given here? > https://github.com/nextgenusfs/funannotate/wiki. I'm completely willing > to give it a shot... > > Is brings up other questions for me, though. How do you do your > functional annotation? Maker? I noticed that funannotate will do functional > annotation, but currently was adding in my functional annotation using GAG > when I was converting the maker gff to tbl. > > Also, from what I understand, funannotate will output a gbk from the gff. > Do you have a particular file conversion tool to get that onto the sqn > format that you've had success with? > > Thanks, > Glenna > ________________________________________ > From: Jason Stajich [jason.stajich at gmail.com] > Sent: Wednesday, June 21, 2017 2:25 PM > To: Glenna Kramer; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] How to address errors encountered in process of > submitting genome > > Glenna - > > FWIW - I've switched to doing an EVM cleanup with funannotate after MAKER > due to these issues with MAKER and fungal genomes I submit. > > Jason > > On Wed, Jun 21, 2017 at 8:51 AM Glenna Kramer > wrote: > Hi there, > > I am attempting to submit a fungal genome to NCBI and have run into quite > a few errors running tbl2asn. I know this isn't directly related to MAKER, > but I'm hoping that someone here has been through this process and would be > able to give some insight (or at least point me in the direction of another > knowledgeable source)! > > Here is a general overview of the process that have used so far: > 1. Converted MAKER GFF3 files to tbl files using GAG (ran options > remove_introns_shorter_than 10 and fix_start_stop and added functional > annotation as well). This seems to work well to convert the GFF3 to tbl. > 2. Use tbl2asn to convert the tbl to sqn file. This also seems to work, > but I am getting lots of errors in the .val output file, which I am unsure > how to address. There are... > > 56 ERROR: SEQ_FEAT.BadTrailingHyphen > 1525 ERROR: SEQ_FEAT.InternalStop > 1142 ERROR: SEQ_FEAT.NoStop > 10 ERROR: SEQ_FEAT.PartialProblem > 2368 ERROR: SEQ_FEAT.StartCodon > 2368 ERROR: SEQ_INST.BadProteinStart > 1525 ERROR: SEQ_INST.StopInProtein > 8821 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > 8543 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > 83 WARNING: SEQ_FEAT.PartialProblem > 1 WARNING: SEQ_FEAT.ProteinNameEndsInBracket > 19 WARNING: SEQ_FEAT.ShortExon > 270 INFO: SEQ_FEAT.RareSpliceConsensusDonor > > Also, just as a side note, has anyone tried the new table2asn_GFF > converter that is up to convert GFF3 directly to sqn? I was thinking that > I would give that a shot hoping that it would help with some of these > errors. However, I was instantly met with an error as well. "Too many > positional arguments (1), the offending value: ends." > > Thank you so much in advance for any help you are able to give! > > Glenna > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Jason Stajich jason.stajich at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Tue Jun 13 11:35:28 2017 From: tfallon at mit.edu (Tim Fallon) Date: Tue, 13 Jun 2017 13:35:28 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes Message-ID: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Hi there, I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: protein_match_example.png Type: image/png Size: 142379 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Fri Jun 16 09:07:14 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 16 Jun 2017 11:07:14 -0400 Subject: [maker-devel] Database disk image is malformed error Message-ID: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Hi there, I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). Have you seen this error before? I?m thinking it could be a couple possibilities: 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. Thoughts? All the best, -Tim Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Thu Jun 22 00:39:28 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Thu, 22 Jun 2017 06:39:28 +0000 Subject: [maker-devel] Maker annotation of large scaffolds Message-ID: Hi All, I'm trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I'm afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ [2] Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: image002.png URL: From munholl at uwindsor.ca Thu Jun 22 09:43:22 2017 From: munholl at uwindsor.ca (Seth Munholland) Date: Thu, 22 Jun 2017 11:43:22 -0400 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD wrote: > Hi All, > > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish > the annotation for maximum scaffolds except 5 of them which are of size > around 100M base pairs. The current clusters in our institute has a time > limit of 24hrs for a job and these scaffolds could not be annotated with in > that time. > > Can you please suggest any other way of finishing the annotation for large > scaffolds? > > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would > split a gene into two. > > Thanks for your time. > > > > Regards, > > *Aravind PRASAD :: Research Officer :: > Comparative and Medical Genomics Lab :: Institue of Molecular and Cell > Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)* > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 > 9573 <+65%206586%209573> :: Fax (+65) 6779 1117 <+65%206779%201117> :: > http://www.imcb.a-star.edu.sg/ > > > > [image: 2] > > > > > > > Note: This message may contain confidential information. If this Email/Fax > has been sent to you by mistake, please notify the sender and delete it > immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18240 bytes Desc: not available URL: From carsonhh at gmail.com Thu Jun 22 22:06:00 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:06:00 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson > On Jun 22, 2017, at 9:43 AM, Seth Munholland wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 <> > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > > Can you please suggest any other way of finishing the annotation for large scaffolds? > > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > > Thanks for your time. > > > Regards, > > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 22:15:09 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:15:09 -0600 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> Message-ID: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). Thanks, Carson > On Jun 16, 2017, at 9:07 AM, Tim Fallon wrote: > > Hi there, > > I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. > > Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: > > "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? > > I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). > > Have you seen this error before? I?m thinking it could be a couple possibilities: > 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. > 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. > 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. > > Thoughts? > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jun 22 22:27:02 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 22 Jun 2017 22:27:02 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> Message-ID: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. ?Carson > On Jun 13, 2017, at 11:35 AM, Tim Fallon wrote: > > Hi there, > > I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. > > I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. > > The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. > > Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? > > Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. > > All the best, > -Tim > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Thu Jun 22 22:31:58 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Fri, 23 Jun 2017 04:31:58 +0000 Subject: [maker-devel] Request a favor regarding MAKER In-Reply-To: References: Message-ID: <02EDBAC4-5338-4FE3-99AA-D98CF346A753@genetics.utah.edu> Sorry for the slow reply. This message somehow got overlooked. The lock is referring to a file lock. It usually means there is another active MAKER process that is trying to run in the same directory as your current maker process. This may mean you may have problems with the MPI setup if using MAKER under MPI. Or if you started MAKER multiple times simultaneously, then you got a collision when both are trying to work with the same data. Just kill all active MAKER processes and restart if that is the case. If it?s an MPI issue run maker with the -h flag added to the the current MPI command you are using to run MAKER. If it prints the help message more than once, then the MPI communication ring is having an issue. This could be a problem with how you installed MAKER or how you installed MPI. --Carson > On Jun 9, 2017, at 2:39 AM, shaf wrote: > > Greetings, > My name is Shaf and currently I'm using MAKER for my data. I did managed get some result using MAKER but i have problem with my storage. > > So I tried run maker on other directory with big space . > As far as I know i already set my maker can be run anywhere. > > I installed my maker on / > Then when i tried to run it on /media/nklee/2TB data/example$ maker ; i've got an error > > ERROR: The directory is locked. Perhaps by an instance of MAKER. > > --> rank=NA, hostname=Lee-Server > > I did checked it using nklee at Lee-Server:/media/nklee/2TB data$ maker > and my maker is there. > > May I know how to solve this problem?Thank you in advance. > > Regards, > Shaf > > > From tfallon at mit.edu Thu Jun 22 22:33:59 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:33:59 -0400 Subject: [maker-devel] Database disk image is malformed error In-Reply-To: <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> References: <41AECDEF-7E2B-4143-BDBC-F10937AEDE3B@mit.edu> <925025C6-C48C-4E29-9A9F-A7CD50ECA211@gmail.com> Message-ID: Hi Carson, Thanks for the tip! The issue turned out that I needed using the ?-l? parameter for gff3_merge, to automatically rename the IDs when merging them, and also to pass the appropriate evidence in the merged GFF using the "Re-annotation Using MAKER Derived GFF3? parameters. I was using the more general parameters down below (protein_gff , est_gff etc). Seems to be working now, though I am still getting the hang of how to fix up misbehaving gene models. All the best, -Tim > On Jun 23, 2017, at 12:15 AM, Carson Holt wrote: > > Don?t use the GFF3 as input to the second stage. Use the original work directory, and just modify and parameters in the control file. MAKER will reuse old results and only delete things that require rerun. Using the GFF3 as input is just a way to reuse MAKER data when the work directory is no longer available, and in most cases you will only pass in the genes (and not the evidence in the GFF3). > > Thanks, > Carson > > > >> On Jun 16, 2017, at 9:07 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I?ve been running MAKER in a 2 stage way using MPI, to annotate a de novo insect genome. By two stage, I mean for stage 1 I have a lot of independent folders / maker runs (e.g. individuals reference insect proteomes passed as FASTA with protein2genome=1), and then for stage 2 in a separate folder I am concatenating all that evidence from Stage 1 (using gff3_merge -o) and passing it as GFF parameters. >> >> Stage 2 has been crashing. It takes a very long time to setup the SQLite DB from the (~24 hours, with 39 MPI CPUs), and then once it is all loaded it works for a couple seconds then crashes with things like this: >> >> "DBD::SQLite::db selectcol_arrayref failed: database disk image is malformed at /lab/solexa_weng/testtube/maker_3.00_beta/bin/../lib/GFFDB.pm line 525.? >> >> I am passing a lot of evidence to Stage 2, probably more than people typically pass (the GFFs together are 44GB, whereas the resulting *.db file is 95G). >> >> Have you seen this error before? I?m thinking it could be a couple possibilities: >> 1) Running up against SQLite size / concurrency constraints where the .db ends up being malformed due to MPI / passing too much evidence. Solution -> Load GFFs without MPI, or load less evidence. >> 2) GFFs are malformed (they pass validation with GT). Solution -> Remove the malformed GFF evidence, although I haven?t been able to track any malformed GFFs down. >> 3) Identifiers in the GFF that are unique when in a single file, become non-unique. Solution -> Manually rename IDs in passed GFF files to be unique. >> >> Thoughts? >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From tfallon at mit.edu Thu Jun 22 22:59:10 2017 From: tfallon at mit.edu (Tim Fallon) Date: Fri, 23 Jun 2017 00:59:10 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> Message-ID: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Hi Carson, Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? All the best, -Tim > On Jun 23, 2017, at 12:27 AM, Carson Holt wrote: > > The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. > > ?Carson > > > >> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >> >> Hi there, >> >> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >> >> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >> >> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >> >> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >> >> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >> >> All the best, >> -Tim >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu >> >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From aravindp at imcb.a-star.edu.sg Fri Jun 23 02:25:18 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Fri, 23 Jun 2017 08:25:18 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Mon Jun 26 03:48:23 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Mon, 26 Jun 2017 09:48:23 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>, Message-ID: <1498470630221.84642@unil.ch> Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can't be run in the same directory). -Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:38:19 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:38:19 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498470630221.84642@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> Message-ID: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson > On Jun 26, 2017, at 3:48 AM, Patrick Tran Van wrote: > > Thanks for your answer. > > 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? > Because I am using autoAug for this and it tooks a while to compute .. > > 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: > > WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl > > (I am using v 2.31.8 ) > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 5, 2017 8:29 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Your plan sounds good. A couple of related notes. > > Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. > > Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). > > ?Carson > > >> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >> >> Hello, >> >> This is my first time running Maker for an insect genome annotation. >> >> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >> >> >> What I have: >> - RNA evidence: transcriptome >> - Proteine evidence: swissprot/uniprot + busco protein set of insect >> - Cegma and busco results of my genome >> >> >> 1) Train SNAP with CEGMA >> >> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >> >> 3) Create SNAP model from run A. >> >> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 5) Create SNAP model from run B. >> >> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >> >> 7) Create SNAP model from run C AND Create Augustus gene model from run C >> >> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >> >> >> >> Does it seems coherent ? >> >> Cheers, >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:48:03 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:03 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 15:48:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 15:48:46 -0600 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: References: Message-ID: Also you can run MPI within a single node and not across nodes. This will still give a performance bonus equal to the MPI process count ?Carson > On Jun 23, 2017, at 2:25 AM, Aravind PRASAD wrote: > > Hi All, > > Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. > > Carson, Can you please explain what exactly does the blast_depth option does while running Maker? > > Thank you all for your time! > > Regards, > Aravind. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Friday, 23 June, 2017 12:06 PM > To: Seth Munholland > Cc: Aravind PRASAD; maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker annotation of large scaffolds > > If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. > > Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. > > ?Carson > > > > On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: > > I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. > > Seth Munholland, B.Sc. > Department of Biological Sciences > Rm. 304 Biology Building > University of Windsor > 401 Sunset Ave. N9B 3P4 > T: (519) 253-3000 Ext: 4755 > > On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: > Hi All, > > I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. > Can you please suggest any other way of finishing the annotation for large scaffolds? > > I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. > Thanks for your time. > > Regards, > Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > > > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 26 16:00:24 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 26 Jun 2017 16:00:24 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. ?Carson > On Jun 22, 2017, at 10:59 PM, Tim Fallon wrote: > > Hi Carson, > > Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. > > Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. > > Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? > > All the best, > -Tim > >> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >> >> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >> >> ?Carson >> >> >> >>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>> >>> Hi there, >>> >>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>> >>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>> >>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>> >>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>> >>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>> >>> All the best, >>> -Tim >>> >>> Timothy R. Fallon >>> PhD candidate >>> Laboratory of Jing-Ke Weng >>> Department of Biology >>> MIT >>> >>> tfallon at mit.edu >>> >>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From aravindp at imcb.a-star.edu.sg Tue Jun 27 01:07:50 2017 From: aravindp at imcb.a-star.edu.sg (Aravind PRASAD) Date: Tue, 27 Jun 2017 07:07:50 +0000 Subject: [maker-devel] Maker annotation of large scaffolds In-Reply-To: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> References: <77736B2B-643C-417F-AB5E-5C709735E3AF@gmail.com> Message-ID: Thank you Carson for the explanation. The issue is now resolved for the annotation of large scaffolds with the use of MPI Maker as well as changing the blast_depth option. Aravind Prasad. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 27 June, 2017 5:48 AM To: Aravind PRASAD Cc: Seth Munholland; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds All results are kept and placed into the final GFF3 unless you set blast_depth. Basically, most alignments are redundant and can be thrown away early in the process. But maker does not do this by default because most users tend to want to see all evidence. MAKER only uses 10 alignments to build it?s calculations anyways. The rest are just kept for reference. But the cost of keeping the other alignments around can be substantial if you are in a region with deep evidence depth (I?ve seen regions with 5,000 - 10,000 evidence alignments for some datasets). So if you set blast_depth, it tells MAKER you are ok with throwing out the extra depth early (MAKER still parses all alignments it just throws extra ones away as it determines they are not useful or are redundant). This saves a lot of time and RAM downstream at the cost of losing the alignments in the report. A depth of 10 means that no more than 10 alignments per data source will be kept per locus. ?Carson On Jun 23, 2017, at 2:25 AM, Aravind PRASAD > wrote: Hi All, Thank you for your inputs. Currently, I?m not using the MPI version but running Maker in multiple instances. Previously, I tried to run the MPI version but failed. Though the installation had no issues with MPI-Maker. Carson, Can you please explain what exactly does the blast_depth option does while running Maker? Thank you all for your time! Regards, Aravind. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, 23 June, 2017 12:06 PM To: Seth Munholland Cc: Aravind PRASAD; maker-devel at yandell-lab.org Subject: Re: [maker-devel] Maker annotation of large scaffolds If running under MPI, the only step that should take a long time would be a final clustering step (the clustering is not parallelized). It should run in well under 24 hours though, so perhaps it is a memory issue or a feature depth issue. You can try running the contig by itself and setting all the bast_depth parameters in maker_bopts.ctl to 10 to help both. Otherwise making a large overlap for subdivided contigs (50-100kb) should be enough. Alternatively look for streches of NNNNNN?s in the contig and split on those. ?Carson On Jun 22, 2017, at 9:43 AM, Seth Munholland > wrote: I would think splitting could work if you generate a sufficient overlap. IE 1-100k, 50-150k, etc. Reassembling the annotations for the overlap regions may be tricky if you get conflicting annotations though. Seth Munholland, B.Sc. Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Thu, Jun 22, 2017 at 2:39 AM, Aravind PRASAD > wrote: Hi All, I?m trying to annotate a fish genome using Maker pipeline. It could finish the annotation for maximum scaffolds except 5 of them which are of size around 100M base pairs. The current clusters in our institute has a time limit of 24hrs for a job and these scaffolds could not be annotated with in that time. Can you please suggest any other way of finishing the annotation for large scaffolds? I thought of chunking up the scaffolds to run, but, I?m afraid that would split a gene into two. Thanks for your time. Regards, Aravind PRASAD :: Research Officer :: Comparative and Medical Genomics Lab :: Institue of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #5-04 Proteos :: Singapore 138673:: DID (+65) 6586 9573 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at hovedpuden.dk Wed Jun 28 02:54:40 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Wed, 28 Jun 2017 10:54:40 +0200 Subject: [maker-devel] maker with MPI and perl using threads Message-ID: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Hello, Recently I assisted one of my customers with problems solving maker using MPI. It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. During the debugging we also found that it was beneficial to have the latest mxm.c installed: https://community.mellanox.com/thread/3439 Best Regards John Damm S?rensen IT consultant From carsonhh at gmail.com Thu Jun 29 14:43:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:43:21 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). Thanks, Carson > On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: > > Hello, > > Recently I assisted one of my customers with problems solving maker using MPI. > > It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. > > In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. > > I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. > > During the debugging we also found that it was beneficial to have the latest mxm.c installed: > > https://community.mellanox.com/thread/3439 > > > Best Regards > > John Damm S?rensen > > IT consultant > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jun 29 14:56:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 29 Jun 2017 14:56:46 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk> Message-ID: Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. ?Carson > On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: > > MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. > > If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. > > I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). > > Thanks, > Carson > > >> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >> >> Hello, >> >> Recently I assisted one of my customers with problems solving maker using MPI. >> >> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >> >> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >> >> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >> >> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >> >> https://community.mellanox.com/thread/3439 >> >> >> Best Regards >> >> John Damm S?rensen >> >> IT consultant >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From qlian003 at ucr.edu Fri Jun 30 13:30:19 2017 From: qlian003 at ucr.edu (Qihua Liang) Date: Fri, 30 Jun 2017 12:30:19 -0700 Subject: [maker-devel] Possible ways to improve annotated gene numbers Message-ID: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Dear Maker Development Team, Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. But for all.maker.proteins.fasta, the completeness is only ~80%. 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? Thank you Qihua