From patrick.tranvan at unil.ch Sat Jul 1 06:21:37 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Sat, 1 Jul 2017 11:21:37 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch>, <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Message-ID: <1498908228256.16549@unil.ch> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. I have then use SNAP to train/filter it with: maker2zff specie.all.gff Here are my results: Number of gene after maker -> Number of gene after maker2zff - Without corrected_est_fusion: 21621 -> 13875 - With corrected_est_fusion: 16850 -> 9098 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? Normally I should find more genes with corrected_est_fusion right ? 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? Thanks for your help Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 26, 2017 11:38 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt > Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jul 1 12:41:28 2017 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 1 Jul 2017 11:41:28 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk>

Message-ID: FindBin is necessary for library control and is safe to load before forks. ?Carson > On Jul 1, 2017, at 11:38 AM, John Damm S?rensen wrote: > > Thanks Carson, > > One thing bothers me. That's this from Perl forks documentation: > > module load order: forks first > > Since forks overrides core Perl functions, you are *strongly* encouraged to load the forks module before any other Perl modules. This will insure the most consistent and stable system behavior. This can be easily done without affecting existing code, like: > > perl -Mforks script.pl > > But in the maker perlscript the module FindBin that in sturn loads a bunch of other modules is loaded before forks. > > Is that intentionally? > > Best > > John > > > > > Den 29-06-2017 kl. 22:56 skrev Carson Holt: >> Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. >> >> ?Carson >> >> >> >>> On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: >>> >>> MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. >>> >>> If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. >>> >>> I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). >>> >>> Thanks, >>> Carson >>> >>> >>>> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >>>> >>>> Hello, >>>> >>>> Recently I assisted one of my customers with problems solving maker using MPI. >>>> >>>> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >>>> >>>> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >>>> >>>> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >>>> >>>> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >>>> >>>> https://community.mellanox.com/thread/3439 >>>> >>>> >>>> Best Regards >>>> >>>> John Damm S?rensen >>>> >>>> IT consultant >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From john at hovedpuden.dk Sat Jul 1 12:38:14 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Sat, 1 Jul 2017 19:38:14 +0200 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk>

Message-ID: Thanks Carson, One thing bothers me. That's this from Perl forks documentation: module load order: forks first Since forks overrides core Perl functions, you are *strongly* encouraged to load the forks module before any other Perl modules. This will insure the most consistent and stable system behavior. This can be easily done without affecting existing code, like: perl -Mforks script.pl But in the maker perlscript the module FindBin that in sturn loads a bunch of other modules is loaded before forks. Is that intentionally? Best John Den 29-06-2017 kl. 22:56 skrev Carson Holt: > Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. > > ?Carson > > > >> On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: >> >> MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. >> >> If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. >> >> I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). >> >> Thanks, >> Carson >> >> >>> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >>> >>> Hello, >>> >>> Recently I assisted one of my customers with problems solving maker using MPI. >>> >>> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >>> >>> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >>> >>> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >>> >>> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >>> >>> https://community.mellanox.com/thread/3439 >>> >>> >>> Best Regards >>> >>> John Damm S?rensen >>> >>> IT consultant >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 3 15:50:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Jul 2017 14:50:21 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498908228256.16549@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> <1498908228256.16549@unil.ch> Message-ID: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think). So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models. The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split). You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ Thanks, Carson > On Jul 1, 2017, at 5:21 AM, Patrick Tran Van wrote: > > So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. > > I have then use SNAP to train/filter it with: > > maker2zff specie.all.gff > > Here are my results: > > Number of gene after maker -> Number of gene after maker2zff > > - Without corrected_est_fusion: 21621 -> 13875 > - With corrected_est_fusion: 16850 -> 9098 > > 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? > Normally I should find more genes with corrected_est_fusion right ? > > 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? > > Thanks for your help > > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 26, 2017 11:38 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Sorry the option is ?> correct_est_fusion > > It is in the maker_opts.ctl file. > > I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. > > ?Carson > > > >> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: >> >> Thanks for your answer. >> >> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? >> Because I am using autoAug for this and it tooks a while to compute .. >> >> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: >> >> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl >> >> (I am using v 2.31.8 ) >> >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> From: Carson Holt > >> Sent: Monday, June 5, 2017 8:29 PM >> To: Patrick Tran Van >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Advice on my pipeline >> >> Your plan sounds good. A couple of related notes. >> >> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. >> >> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). >> >> ?Carson >> >> >>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >>> >>> Hello, >>> >>> This is my first time running Maker for an insect genome annotation. >>> >>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >>> >>> >>> What I have: >>> - RNA evidence: transcriptome >>> - Proteine evidence: swissprot/uniprot + busco protein set of insect >>> - Cegma and busco results of my genome >>> >>> >>> 1) Train SNAP with CEGMA >>> >>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >>> >>> 3) Create SNAP model from run A. >>> >>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >>> >>> 5) Create SNAP model from run B. >>> >>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >>> >>> 7) Create SNAP model from run C AND Create Augustus gene model from run C >>> >>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >>> >>> >>> >>> Does it seems coherent ? >>> >>> Cheers, >>> >>> Patrick Tran Van >>> >>> Groups Chapuisat, Robinson-Rechavi & Schwander >>> Department of Ecology and Evolution >>> University of Lausanne >>> Le Biophore >>> CH-1015 Lausanne >>> Switzerland >>> Office 3206 >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 3 16:04:40 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Jul 2017 15:04:40 -0600 Subject: [maker-devel] Possible ways to improve annotated gene numbers In-Reply-To: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> References: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Message-ID: <903B12C5-CC57-46F3-B3E6-1322C9155F2F@gmail.com> MAKER excludes models without evidence support (this is because gene predictors can overcall by as much as a factor of 10, i.e. lots of false positives). So you may be lacking in either protein or transcript evidence (you should alway supply a minimum of 2 related proteomes for any MAKER analysis - transcript evidence by itself is insufficient). You can also try and rescue models based on protein domain content using iprscan. Details in this protocol paper ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ ?Carson > On Jun 30, 2017, at 1:30 PM, Qihua Liang wrote: > > Dear Maker Development Team, > > Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. > > For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. > > But for all.maker.proteins.fasta, the completeness is only ~80%. > > 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? > > 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? > > > Thank you > Qihua > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Jul 4 23:05:10 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 5 Jul 2017 14:05:10 +1000 Subject: [maker-devel] advanced repeat libraries Message-ID: Hi, I'm dealing with a fungal genome with at least 40% of repeats, so I'm trying to follow the advanced repeat construction protocol. So far, so good, but I have doubts about how to build the protein database as explained at the end of the page In summary 1. get SwissProt and RefSeq fungal proteins 2. tblastn (from 1) against EST-NCBI database and keep the matches 3. blastp the output from 2 against the transposase protein db. Remove matches but from here on I'm a bit lost... "Finally, the rice protein sequences were compared with verified transposons (such as Pack-MULEs) in the rice genome. If the protein sequence matched a transposon perfectly and was the only perfect match in the genome, the relevant protein sequence was excluded. Although elements such as Pack-MULEs contain true gene sequences, the annotation (the protein sequence in the database) often extends to non-gene sequences such as terminal inverted repeat or sub-terminal repeat, which are not true plant proteins and would cause great complications. As a result, it is essential to exclude them." Are the proteins kept at the end of the step 3 the 'protein database'? Could you provide a bit more detail on how to tackle this? Thank you in advance, Xabi -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Thu Jul 6 07:45:20 2017 From: tfallon at mit.edu (Tim Fallon) Date: Thu, 6 Jul 2017 08:45:20 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Hi Carson, This region is definitely entirely correct at the genomic nucleotide level, no missassemblies. Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions? Right now this is what I?m thinking, as troubleshooting the ab-initio training seems like it could be a long road. All the best, -Tim > On Jun 26, 2017, at 6:00 PM, Carson Holt wrote: > > Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. > > In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. > > ?Carson > >> On Jun 22, 2017, at 10:59 PM, Tim Fallon > wrote: >> >> Hi Carson, >> >> Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. >> >> Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. >> >> Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? >> >> All the best, >> -Tim >> >>> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >>> >>> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >>> >>> ?Carson >>> >>> >>> >>>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>>> >>>> Hi there, >>>> >>>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>>> >>>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>>> >>>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>>> >>>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>>> >>>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>>> >>>> All the best, >>>> -Tim >>>> >>>> Timothy R. Fallon >>>> PhD candidate >>>> Laboratory of Jing-Ke Weng >>>> Department of Biology >>>> MIT >>>> >>>> tfallon at mit.edu >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From labovolenta at gmail.com Mon Jul 10 13:57:21 2017 From: labovolenta at gmail.com (Luiz Augusto Bovolenta) Date: Mon, 10 Jul 2017 15:57:21 -0300 Subject: [maker-devel] Error "Assertion ((sv)->sv_flags &" failed: file "mg.c" Message-ID: Hi colleagues. I recently installed the Maker using manual steps for dependencies. However, when I try to execute the maker command I receive this error: Assertion ((sv)->sv_flags & (0x00200000|0x00400000|0x00800000)) failed: file "mg.c", line 88 at /usr/lib/perl5/site_perl/5.10.0/Sys/SigAction.pm line 145. Compilation failed in require at ./maker line 45. BEGIN failed--compilation aborted at ./maker line 45. Someone have some idea about this error? Best regards Luiz -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 10 14:10:51 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 10 Jul 2017 13:10:51 -0600 Subject: [maker-devel] Error "Assertion ((sv)->sv_flags &" failed: file "mg.c" In-Reply-To: References: Message-ID: If you are installing without MPI support then, something is wrong with your perl installation or one of the modules installed with your perl. You may want to reinstall perl, or try and reinstall modules listed in the error one at a time using CPAN (use 'force install ? to force reinstall). Modules to try (some were given by name and others by line in your error): forks forks:shared Sys::SigAction Alternatively if this is an MPI install, make sure you have added the required environmental variables (i.e. LD_PRELOAD for OpenMPI) and command line flags (i.e. -mca btl ^openib) listed in the ?/maker/INSTALL file, and that you are not running an incompatible MPI flavor such as MVAPICH2 (also explained in the ?/maker/INSTALL file). ?Carson > On Jul 10, 2017, at 12:57 PM, Luiz Augusto Bovolenta wrote: > > Hi colleagues. > I recently installed the Maker using manual steps for dependencies. However, when I try to execute the maker command I receive this error: > > Assertion ((sv)->sv_flags & (0x00200000|0x00400000|0x00800000)) failed: file "mg.c", line 88 at /usr/lib/perl5/site_perl/5.10.0/Sys/SigAction.pm line 145. > Compilation failed in require at ./maker line 45. > BEGIN failed--compilation aborted at ./maker line 45. > > Someone have some idea about this error? > > Best regards > Luiz > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 10 14:20:15 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 10 Jul 2017 13:20:15 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu>

Message-ID: <4D5E9712-E95B-4687-8706-2AB445191C89@gmail.com> est2genome and protein2genome will almost always be partial. Also the error rate on draft assemblies is much higher than most people realize. Beyond issues already mentioned in the previous e-mail, there is also the issue that organisms are diploid, but the assembly is haploid, so variation gets squashed which also breaks ORFs (there are several examples of this in both the mature human and mouse genome assemblies). For many draft assemblies, you can expect ORF affecting errors in as much as 10-15% of your annotations. Try opening the cases with issues and manually editing them in Apollo. Possible sources of sequence guiding the annotation may become more apparent (look at mismatches in the mRNA-seq alignments relative to the assembly for example). And if not, and the region is just too complex for the predictor, then you can force the model with Apollo. ?Carson > On Jul 6, 2017, at 6:45 AM, Tim Fallon wrote: > > Hi Carson, > > This region is definitely entirely correct at the genomic nucleotide level, no missassemblies. Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions? Right now this is what I?m thinking, as troubleshooting the ab-initio training seems like it could be a long road. > > All the best, > -Tim > >> On Jun 26, 2017, at 6:00 PM, Carson Holt > wrote: >> >> Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. >> >> In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. >> >> ?Carson >> >>> On Jun 22, 2017, at 10:59 PM, Tim Fallon > wrote: >>> >>> Hi Carson, >>> >>> Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. >>> >>> Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. >>> >>> Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? >>> >>> All the best, >>> -Tim >>> >>>> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >>>> >>>> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>>>> >>>>> Hi there, >>>>> >>>>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>>>> >>>>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>>>> >>>>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>>>> >>>>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>>>> >>>>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>>>> >>>>> All the best, >>>>> -Tim >>>>> >>>>> Timothy R. Fallon >>>>> PhD candidate >>>>> Laboratory of Jing-Ke Weng >>>>> Department of Biology >>>>> MIT >>>>> >>>>> tfallon at mit.edu >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> Timothy R. Fallon >>> PhD candidate >>> Laboratory of Jing-Ke Weng >>> Department of Biology >>> MIT >>> >>> tfallon at mit.edu >> > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Thu Jul 13 12:00:18 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Thu, 13 Jul 2017 17:00:18 +0000 Subject: [maker-devel] Question regarding MAKER In-Reply-To: References: Message-ID: est2genome and protein2genome take BLAST hits, polish them with exonerate around splice sites and then turn the alignment directly into a gene model. So if the alignment is partial because the EST or mRNA-seq do not cross the entire transcript or the protein homology does not cross the entire CDS, then the resulting model will be partial. But hundreds of even partial models are sufficient to train SNAP. Then I usually do just one round of bootstrap training (more than that and you get into the overtraining paradox). So you can use just est2genome, just protein2genome, or both. You just need something to train SNAP with. ?Carson On Jul 11, 2017, at 3:37 PM, Ghosh, Arnab > wrote: Hi Carson, My name is Arnab and I am from Texas Tech University. I am using MAKER for gene annotation in a new genome assembly for a non-model organism. I have mostly figured out everything of this amazing piece of software but had two questions. 1. Is it okay to use only est2genome =1 and leave the protein2genome=0 option out in the first round of running MAKER ? Will it hurt my prediction and eventual annotation of gene if I don?t use the protein2genomeoption ALONGSIDE est2genome in the first round? I have a protein fasta file for the same organism but using the transcript fasta file (same organism) AND the protein fasta file for the whole genome (~ 2.2 GB in size) is just taking too long to finish. 1. I will of course run SNAP in the second round which also leads me to my second question as to what according to you is an acceptable number of iterations to run bootstrapping of SNAP with MAKER? Thanks and regards Arnab -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhushilin at frasergen.com Wed Jul 12 01:19:12 2017 From: zhushilin at frasergen.com (zhushilin at frasergen.com) Date: Wed, 12 Jul 2017 14:19:12 +0800 Subject: [maker-devel] some suggestion Message-ID: <2017071213575801507119@frasergen.com> Dear developer, It seems that MAKER can only run in the general hard disk which support structrued data, as SQLite was used. When running in lustre filesystem, we got I/O error and nothing was written to .db file which saves the gff information. Maybe the best way is to check the filesystem automatically and give the different strategy to store the information in gff files. Best wishes Shilin Zhu R&D director DEPT. of Bioinformatics Wuhan Frasergen Bioinformatics Co., Ltd B8 building?Biolake?666 Gaoxin Road?Wuhan East Lake High-tech Zone?Wuhan 430075?China T: 027-87224705?M: +86 18502745140 F: 027-87224785?E: service at frasergen.com W: http://www.frasergen.com Disclaimer This e-mail is intended to be used only by persons entitled to receive such information and may contain information that is confidential, proprietary, and/or legally privileged. If you are not the intended recipient, you are hereby notified that any use, retention, disclosure, dissemination, copying, or taking any other action in reliance on contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately contact the sender and delete the e-mail from your mailbox or any other storage mechanism. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 18 00:06:10 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Jul 2017 23:06:10 -0600 Subject: [maker-devel] some suggestion In-Reply-To: <2017071213575801507119@frasergen.com> References: <2017071213575801507119@frasergen.com> Message-ID: <7A0B500C-B473-4587-9302-E076E42E7734@gmail.com> The system we commonly use with MAKER is Lustre and we get no issues. We also commonly run MAKER on TACC which uses Lustre for all it?s file systems. So there is no Lustre limitation. MAKER does require that each node also have a local temporary directory for some operations which can generate high IOPS or require that traditional flock support (cannot be guaranteed on some NFS systems). These operations occur in the location specified by TMP= in the control file. Perhaps you are attempting to set your TMP value in the control files to a Lustre space which can overload the MDS (metadata server) used by Lustre. Make sure you do not set TMP to a shared location. Your working directory can be a shared Lustre space and result files will be stored there, but IO operations that are not safe for shared spaces will occur in TMP, and TMP must be set to a local storage location (usually /tmp). --Carson > On Jul 12, 2017, at 12:19 AM, zhushilin at frasergen.com wrote: > > Dear developer, > > It seems that MAKER can only run in the general hard disk which support structrued data, as SQLite was used. > When running in lustre filesystem, we got I/O error and nothing was written to .db file which saves the gff information. > > Maybe the best way is to check the filesystem automatically and give the different strategy to store the information in gff files. > > Best wishes > Shilin Zhu R&D director > DEPT. of Bioinformatics > Wuhan Frasergen Bioinformatics Co., Ltd > B8 building?Biolake?666 Gaoxin Road?Wuhan East Lake High-tech Zone?Wuhan 430075?China > T: 027-87224705?M: +86 18502745140 > F: 027-87224785?E: service at frasergen.com > W: http://www.frasergen.com > Disclaimer > This e-mail is intended to be used only by persons entitled to receive such information and may contain information that is confidential, proprietary, and/or legally privileged. If you are not the intended recipient, you are hereby notified that any use, retention, disclosure, dissemination, copying, or taking any other action in reliance on contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately contact the sender and delete the e-mail from your mailbox or any other storage mechanism. Thank you! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjfields at illinois.edu Tue Jul 18 14:18:25 2017 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 18 Jul 2017 19:18:25 +0000 Subject: [maker-devel] some suggestion In-Reply-To: <7A0B500C-B473-4587-9302-E076E42E7734@gmail.com> References: <2017071213575801507119@frasergen.com> <7A0B500C-B473-4587-9302-E076E42E7734@gmail.com> Message-ID: <56D57BD0-7102-4B7A-BFC4-3A72C2678CE4@illinois.edu> We?ve worked with MAKER on Lustre and GPFS w/o significant issues (we?re also in the process of setting up MAKER on cluster using Ceph). I?ve always found the trickiest part in initial MAKER setup and testing is making sure the proper tempfile space is set (point to local disk or /dev/shm) and wrangling MPI issues. chris From: maker-devel on behalf of Carson Holt Date: Tuesday, July 18, 2017 at 12:06 AM To: "zhushilin at frasergen.com" Cc: maker-devel Subject: Re: [maker-devel] some suggestion The system we commonly use with MAKER is Lustre and we get no issues. We also commonly run MAKER on TACC which uses Lustre for all it?s file systems. So there is no Lustre limitation. MAKER does require that each node also have a local temporary directory for some operations which can generate high IOPS or require that traditional flock support (cannot be guaranteed on some NFS systems). These operations occur in the location specified by TMP= in the control file. Perhaps you are attempting to set your TMP value in the control files to a Lustre space which can overload the MDS (metadata server) used by Lustre. Make sure you do not set TMP to a shared location. Your working directory can be a shared Lustre space and result files will be stored there, but IO operations that are not safe for shared spaces will occur in TMP, and TMP must be set to a local storage location (usually /tmp). --Carson On Jul 12, 2017, at 12:19 AM, zhushilin at frasergen.com wrote: Dear developer, It seems that MAKER can only run in the general hard disk which support structrued data, as SQLite was used. When running in lustre filesystem, we got I/O error and nothing was written to .db file which saves the gff information. Maybe the best way is to check the filesystem automatically and give the different strategy to store the information in gff files. Best wishes ________________________________ Shilin Zhu R&D director DEPT. of Bioinformatics Wuhan Frasergen Bioinformatics Co., Ltd B8 building?Biolake?666 Gaoxin Road?Wuhan East Lake High-tech Zone?Wuhan 430075?China T: 027-87224705?M: +86 18502745140 F: 027-87224785?E: service at frasergen.com W: http://www.frasergen.com ________________________________ Disclaimer This e-mail is intended to be used only by persons entitled to receive such information and may contain information that is confidential, proprietary, and/or legally privileged. If you are not the intended recipient, you are hereby notified that any use, retention, disclosure, dissemination, copying, or taking any other action in reliance on contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately contact the sender and delete the e-mail from your mailbox or any other storage mechanism. Thank you! _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Wed Jul 19 08:11:58 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Wed, 19 Jul 2017 13:11:58 +0000 Subject: [maker-devel] MAKER annotation post processing Message-ID: Hi, I have successfully annotated my genome with MAKER. Now I have a gff file that I want to post process /filter. In particular, I would like to discard genes that are below to a certain AED score. 1) Is there an AED treshold from where a gene is not strongly supported ? if yes, do you have some reference about this ? 2) Is there a script/software to process a gff file ? Thanks Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Wed Jul 19 12:20:07 2017 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Wed, 19 Jul 2017 13:20:07 -0400 Subject: [maker-devel] MAKER annotation post processing In-Reply-To: References: Message-ID: Hi Patrick, For point 1, the best AED cutoff to use is quite arbitrary. For one the last genomes that I annotated we had a set of high quality genes identified based on synteny with genes in closely related genomes. We plotted the distribution of AEDs for those genes and found that a cutoff of 0.28 captured 98% of the high quality genes. This value would vary based on the evidence provided. I?ve used 0.5 in the past as a more permissive filter. For point 2, these is a accessory script in the MAKER bin called quality_filter.pl. It has an option (-a) that allows you to put in an AED cutoff and it will filter the gff3 file based on that cutoff. For general processing of GFF3 files, there is a perl library called GAL that is useful if you write code in perl. Take care, Mike > On Jul 19, 2017, at 9:11 AM, Patrick Tran Van wrote: > > Hi, > I have successfully annotated my genome with MAKER. Now I have a gff file that I want to post process /filter. > > In particular, I would like to discard genes that are below to a certain AED score. > > 1) Is there an AED treshold from where a gene is not strongly supported ? if yes, do you have some reference about this ? > > 2) Is there a script/software to process a gff file ? > > Thanks > > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Tue Jul 25 16:48:45 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Tue, 25 Jul 2017 17:48:45 -0400 Subject: [maker-devel] Repeat annotation by Maker2 Message-ID: Hello: We want to summarize the statistical information of repeats for the genome annotated by Maker2. But we are not clear what does the annotation mean. Would you explain? Many thanks! Let me take this example CasCan_contig_16053 repeatmasker match 35887 35996 423 + . ID=CasCan_contig_16053:hit:51261:1.3.0.0;Name=species:Charlie4z|genus:DNA%2FhAT-Charlie;Target=species:Charlie4z|genus:DNA%2FhAT-Charlie 48 161 + (1) "35887" and "35996" are the start and end position of the "match" in this contig, and so for this repeat element it covers 35996-35887+1 (i.e., 110bp) in the contig. Right? (2) What does the "Name=species" (and "Target=species") mean? (3) "genus" show the type of repeat element, right? Then what does "%" mean in "DNA%2FhAT-Charlie" ? (4) what does "48" and "161" mean? Are they the coordinates of the "match" in the repeat element? Examples: CasCan_contig_16053 repeatmasker match 35887 35996 423 + . ID=CasCan_contig_16053:hit:51261:1.3.0.0;Name=species:Charlie4z|genus:DNA%2FhAT-Charlie;Target=species:Charlie4z|genus:DNA%2FhAT-Charlie 48 161 + CasCan_contig_16053 repeatmasker match_part 35887 35996 423 + . ID=CasCan_contig_16053:hsp:120045:1.3.0.0;Parent=CasCan_contig_16053:hit:51261:1.3.0.0;Target=species:Charlie4z|genus:DNA%252FhAT-Charlie 48 161 + CasCan_contig_16053 repeatmasker match 36842 37881 2546 + . ID=CasCan_contig_16053:hit:51262:1.3.0.0;Name=species:L1MC1_EC|genus:LINE%2FL1;Target=species:L1MC1_EC|genus:LINE%2FL1 5384 6062 + CasCan_contig_16053 repeatmasker match_part 36842 37881 2546 + . ID=CasCan_contig_16053:hsp:120046:1.3.0.0;Parent=CasCan_contig_16053:hit:51262:1.3.0.0;Target=species:L1MC1_EC|genus:LINE%252FL1 5384 6062 + Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Thu Jul 27 15:25:19 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Thu, 27 Jul 2017 16:25:19 -0400 Subject: [maker-devel] update the repeatMasker Message-ID: Hello: We had installed the maker2 on our server. We found the existing repeatMasker version is too old, so I plan to install the latest version repeatMasker. I wonder what I need to do to let the maker2 call the latest version of repeatMasker rather than the existing old version? Many thanks Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jul 29 17:20:10 2017 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 29 Jul 2017 16:20:10 -0600 Subject: [maker-devel] update the repeatMasker In-Reply-To: References: Message-ID: <47CF01D4-8146-4A48-9FEE-E9BE75A03C82@gmail.com> The location of executables used is found in the maker_exe.ctl file. Default values in that file are drawn from your PATH environmental variable. You need to add the new RepeatMasker installation to your PATH. Also if maker_exe.ctl predates the change to the path, you will need to manually set the location in that file. Thanks, Carson > On Jul 27, 2017, at 2:25 PM, Quanwei Zhang wrote: > > Hello: > > We had installed the maker2 on our server. We found the existing repeatMasker version is too old, so I plan to install the latest version repeatMasker. I wonder what I need to do to let the maker2 call the latest version of repeatMasker rather than the existing old version? > > Many thanks > > Best > Quanwei > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From qwzhang0601 at gmail.com Mon Jul 31 11:42:04 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 31 Jul 2017 12:42:04 -0400 Subject: [maker-devel] repeats masking Message-ID: Hello: We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html Use in association with gene prediction programs Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 11:48:44 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 10:48:44 -0600 Subject: [maker-devel] repeats masking In-Reply-To: References: Message-ID: MAKER uses the masking primarily for the evidence alignment step. Low complexity regions are soft masked which means alignments can extend through them but must seed outside of the masked region first. Successful BLAST alignments are then polished using exonerate on the unmasked region. Also for the gene predictor, the first run is done with hard masking of the transposons only. So they can still predict in low complexity regions. The second round of hint based prediction is done on the unmasked assembly. So MAKER already handles all the issues you are mentioning. --Carson > On Jul 31, 2017, at 10:42 AM, Quanwei Zhang wrote: > > Hello: > > We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks > > > The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html > Use in association with gene prediction programs > > Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. > > Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. > > > Best > > Quanwei > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From leo1985.arnab at gmail.com Wed Jul 19 15:24:13 2017 From: leo1985.arnab at gmail.com (Arnab Ghosh) Date: Wed, 19 Jul 2017 15:24:13 -0500 Subject: [maker-devel] BUSCO trained Augustus parameter for MAKER Message-ID: Hello, I am trying to annotate genes in a non-model organism. I am using MAKER for this purpose and have trained SNAP and plan on Genemark and Augustus as well for the predictions. After searching the google group on MAKER-devel I understand BUSCO is a good choice for training Augustus and decided to give it a shot. From one of the earlier posts on BUSCO, I read that it is supposed to generate a species name specific folder in the config directory of Augustus. However I did not get any. The BUSCO run finished fine with the run_species folder and all necessary table, hmm folder and other files etc. generated under the run_species folder. Following was the command i used to run BUSCO: python run_BUSCO.py --in mySpecies.fasta --out my_species --long --lineage_path /usr/home/aves_odb9 --mode genome So there was no "my_species" folder generated under the config directory of augustus. Did I miss something ? My question is-- what should I pass to the " *augustus_species"* variable now in the maker control file for the next run of maker ? Thanks so much for your time !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From dandence at gmail.com Mon Jul 31 11:53:11 2017 From: dandence at gmail.com (Daniel Ence) Date: Mon, 31 Jul 2017 12:53:11 -0400 Subject: [maker-devel] repeats masking In-Reply-To: References: Message-ID: <6FCA8DEC-EA5E-489F-A83C-2BA792E1F77D@gmail.com> Hi Quanwei, Running maker on the unmasked genome will probably give you more genes, but won?t be helpful in the end. Maker soft-masks repeats, which prevents blast alignments from being seeded in the masked regions, but still allows them to extend into those regions. This solves the problem missing exons mentioned in the text you sent. There?s an option in the control file to run the ab-inition programs on the unmasked sequence (?unmask?) which is set to false (0) by default. Hope this helps, Daniel Ence > On Jul 31, 2017, at 12:42 PM, Quanwei Zhang wrote: > > Hello: > > We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks > > > The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html > Use in association with gene prediction programs > > Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. > > Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. > > > Best > > Quanwei > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Jul 31 11:59:02 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 31 Jul 2017 12:59:02 -0400 Subject: [maker-devel] repeats masking In-Reply-To: References:

Message-ID: Hi Carson: I see. Thank you for your explanation! Best Quanwei 2017-07-31 12:48 GMT-04:00 Carson Holt : > MAKER uses the masking primarily for the evidence alignment step. Low > complexity regions are soft masked which means alignments can extend > through them but must seed outside of the masked region first. Successful > BLAST alignments are then polished using exonerate on the unmasked region. > > Also for the gene predictor, the first run is done with hard masking of > the transposons only. So they can still predict in low complexity regions. > The second round of hint based prediction is done on the unmasked assembly. > So MAKER already handles all the issues you are mentioning. > > --Carson > > > > > > On Jul 31, 2017, at 10:42 AM, Quanwei Zhang wrote: > > Hello: > > We are using the Maker2 pipeline to annotating a new genome. We just read > something about the repeat masking from repeatMasker's documents. It > suggests to leave low complexity region unmasked and to do gene annotation > using both masked and unmasked genome. I wonder what your opinion and > suggestions on this? Many thanks > > > The paragraph below is from http://www.binfo.ncku.edu.tw/ > RM/webrepeatmaskerhelp.html > Use in association with gene prediction programs > > Predicting genes from a masked sequence faces several problems. First, > one should not mask low complexity regions, e.g. to avoid masking > trinucleotide repeats in coding regions. But even with only interspersed > repeats masked, gene prediction programs may fail to identify exons > correctly. As mentioned above, sometimes tail ends of coding regions may > have originated from transposable elements. Even if no coding regions have > been masked, splice sites may be compromised; e.g. the polypyrimidine > region that is part of the acceptor splice site may be contained within a > repeat. > > Thus, I generally recommend to run a gene prediction program on unmasked > DNA (as well) and compare the predicted genes and exons with the > RepeatMasker output. Some gene prediction program allow you to force > certain exons out of the predictions (e.g. often the old ORFs of LINE1 > elements and endogenous retroviruses are included in genes). Work is also > in progress at several sites to incorporate RepeatMasker into gene > prediction programs, in which cases matches to repeats are weighted in > along with the other parameters used. > > Best > > Quanwei > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 12:02:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 11:02:21 -0600 Subject: [maker-devel] repeats masking In-Reply-To: <6FCA8DEC-EA5E-489F-A83C-2BA792E1F77D@gmail.com> References: <6FCA8DEC-EA5E-489F-A83C-2BA792E1F77D@gmail.com> Message-ID: <256BF853-3B43-4975-9D5D-9D4F30A27AC3@gmail.com> Please note that the unmask option Dan is talking about is a feature to run both masked and unmasked raw predictions in the first round of prediction (it does not affect alignemnt of the second round of predictiopn). It tends to increase the false positive rate but can be a quick test when you believe you are missing a gene because of overmasking from a user created library and protein/EST evidence is overly sparse (so the gene cannot be recovered through evidence alignment and the second round of unmasked prediction). ?Carson > On Jul 31, 2017, at 10:53 AM, Daniel Ence wrote: > > Hi Quanwei, Running maker on the unmasked genome will probably give you more genes, but won?t be helpful in the end. Maker soft-masks repeats, which prevents blast alignments from being seeded in the masked regions, but still allows them to extend into those regions. This solves the problem missing exons mentioned in the text you sent. There?s an option in the control file to run the ab-inition programs on the unmasked sequence (?unmask?) which is set to false (0) by default. > > Hope this helps, > Daniel Ence > > >> On Jul 31, 2017, at 12:42 PM, Quanwei Zhang > wrote: >> >> Hello: >> >> We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks >> >> >> The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html >> Use in association with gene prediction programs >> >> Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. >> >> Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. >> >> >> Best >> >> Quanwei >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 12:09:58 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 11:09:58 -0600 Subject: [maker-devel] BUSCO trained Augustus parameter for MAKER In-Reply-To: References: Message-ID: <453DB44C-2277-4939-A0BA-0DE083F3E380@gmail.com> Hello Arnab, Perhaps someone on the list will chime in with an answer for you, but you may also want to post to direct to BUSCO since your question is entirely related to BUSCO and you are more likely to get a quick response there. ?> https://gitlab.com/ezlab/busco/issues Thanks, Carson > On Jul 19, 2017, at 2:24 PM, Arnab Ghosh wrote: > > Hello, > > I am trying to annotate genes in a non-model organism. I am using MAKER for this purpose and have trained SNAP and plan on Genemark and Augustus as well for the predictions. > > After searching the google group on MAKER-devel I understand BUSCO is a good choice for training Augustus and decided to give it a shot. From one of the earlier posts on BUSCO, I read that it is supposed to generate a species name specific folder in the config directory of Augustus. However I did not get any. The BUSCO run finished fine with the run_species folder and all necessary table, hmm folder and other files etc. generated under the run_species folder. > > Following was the command i used to run BUSCO: > > python run_BUSCO.py --in mySpecies.fasta --out my_species --long --lineage_path /usr/home/aves_odb9 --mode genome > > So there was no "my_species" folder generated under the config directory of augustus. > > Did I miss something ? My question is-- what should I pass to the "augustus_species" variable now in the maker control file for the next run of maker ? > > Thanks so much for your time !! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Jul 31 18:02:29 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 31 Jul 2017 19:02:29 -0400 Subject: [maker-devel] Pseudogene identification Message-ID: Hello: We used Maker2 to annotate a new rodent genome. By using the annotated genes we did gene family expansion analysis, and found several gene families under expansion in the new rodent genome. But we want to check whether some annotated genes are Pseudogenes, which lead to the expansion. Do you have any suggestions on this? We found the Maker-P can annotate Pseudogene, but we are not sure whether it is worth to repeat our annotation with Maker-P. Besides, we are not sure whether the default parameters of Maker-P are good for a rodent species. What's more, in my understanding the Maker-P will identify Pseudogenes in the intergenic spaces (which I think the annotated coding genes will be not be tested and checked). Do you have any suggestions to solve our problem? We do not want to identify Pseudogene on the genome wide, but only want to check those genes showing expansion (to make sure all those gene copies really function). Many thanks Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Jul 31 18:46:52 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Tue, 1 Aug 2017 09:46:52 +1000 Subject: [maker-devel] BUSCO trained Augustus parameter for MAKER In-Reply-To: <453DB44C-2277-4939-A0BA-0DE083F3E380@gmail.com> References: <453DB44C-2277-4939-A0BA-0DE083F3E380@gmail.com> Message-ID: AFAIK, the profile can be the folder run_my_species/augustus_output/retraining_parameters and the files inside would look like BUSCO_my_species_[a number]_*.* Create a folder in the species directory and copy this in there. Note that the folder and the included files should have the same name prefix e.g. BUSCO_my_species_[a number], otherwise whenever you call the profile, Augustus is going to freak out. The prefix will be your augustus_species On 1 August 2017 at 03:09, Carson Holt wrote: > Hello Arnab, > > Perhaps someone on the list will chime in with an answer for you, but you > may also want to post to direct to BUSCO since your question is entirely > related to BUSCO and you are more likely to get a quick response there. > > ?> https://gitlab.com/ezlab/busco/issues > > Thanks, > Carson > > > > > On Jul 19, 2017, at 2:24 PM, Arnab Ghosh wrote: > > Hello, > > I am trying to annotate genes in a non-model organism. I am using MAKER > for this purpose and have trained SNAP and plan on Genemark and Augustus as > well for the predictions. > > After searching the google group on MAKER-devel I understand BUSCO is a > good choice for training Augustus and decided to give it a shot. From one > of the earlier posts on BUSCO, I read that it is supposed to generate a > species name specific folder in the config directory of Augustus. However I > did not get any. The BUSCO run finished fine with the run_species folder > and all necessary table, hmm folder and other files etc. generated under > the run_species folder. > > Following was the command i used to run BUSCO: > > python run_BUSCO.py --in mySpecies.fasta --out my_species --long > --lineage_path /usr/home/aves_odb9 --mode genome > > So there was no "my_species" folder generated under the config directory > of augustus. > > Did I miss something ? My question is-- what should I pass to the " > *augustus_species"* variable now in the maker control file for the next > run of maker ? > > Thanks so much for your time !! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 18:54:12 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 17:54:12 -0600 Subject: [maker-devel] Pseudogene identification In-Reply-To: References: Message-ID: The MAKER-P fork was merged back into standard MAKER with version 2.29 (roughly 3 years ago - a separate download no longer exists). This is because MAKER-P?s functionality is almost entirely in accessory scripts and written protocols. The ?/maker/bin/maker called by both MAKER2 and MAKER-P is actually the same script. So no need to rerun, because if you are using version 2.29 or later, you already ran it. Pseudogene calling is therefore handled by accessory scripts and protocols you can find here ?> http://shiulab.plantbiology.msu.edu/wiki/index.php/Protocol:Pseudogene The other MAKER-P protocols can be found here ?> http://www.yandell-lab.org/software/maker-p.html --Carson > On Jul 31, 2017, at 5:02 PM, Quanwei Zhang wrote: > > Hello: > > We used Maker2 to annotate a new rodent genome. By using the annotated genes we did gene family expansion analysis, and found several gene families under expansion in the new rodent genome. But we want to check whether some annotated genes are Pseudogenes, which lead to the expansion. Do you have any suggestions on this? > > We found the Maker-P can annotate Pseudogene, but we are not sure whether it is worth to repeat our annotation with Maker-P. Besides, we are not sure whether the default parameters of Maker-P are good for a rodent species. What's more, in my understanding the Maker-P will identify Pseudogenes in the intergenic spaces (which I think the annotated coding genes will be not be tested and checked). > > Do you have any suggestions to solve our problem? We do not want to identify Pseudogene on the genome wide, but only want to check those genes showing expansion (to make sure all those gene copies really function). > > Many thanks > > Best > Quanwei > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Sat Jul 1 05:21:37 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Sat, 1 Jul 2017 11:21:37 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch>, <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Message-ID: <1498908228256.16549@unil.ch> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. I have then use SNAP to train/filter it with: maker2zff specie.all.gff Here are my results: Number of gene after maker -> Number of gene after maker2zff - Without corrected_est_fusion: 21621 -> 13875 - With corrected_est_fusion: 16850 -> 9098 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? Normally I should find more genes with corrected_est_fusion right ? 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? Thanks for your help Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 26, 2017 11:38 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt > Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jul 1 11:41:28 2017 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 1 Jul 2017 11:41:28 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk>

Message-ID: FindBin is necessary for library control and is safe to load before forks. ?Carson > On Jul 1, 2017, at 11:38 AM, John Damm S?rensen wrote: > > Thanks Carson, > > One thing bothers me. That's this from Perl forks documentation: > > module load order: forks first > > Since forks overrides core Perl functions, you are *strongly* encouraged to load the forks module before any other Perl modules. This will insure the most consistent and stable system behavior. This can be easily done without affecting existing code, like: > > perl -Mforks script.pl > > But in the maker perlscript the module FindBin that in sturn loads a bunch of other modules is loaded before forks. > > Is that intentionally? > > Best > > John > > > > > Den 29-06-2017 kl. 22:56 skrev Carson Holt: >> Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. >> >> ?Carson >> >> >> >>> On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: >>> >>> MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. >>> >>> If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. >>> >>> I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). >>> >>> Thanks, >>> Carson >>> >>> >>>> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >>>> >>>> Hello, >>>> >>>> Recently I assisted one of my customers with problems solving maker using MPI. >>>> >>>> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >>>> >>>> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >>>> >>>> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >>>> >>>> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >>>> >>>> https://community.mellanox.com/thread/3439 >>>> >>>> >>>> Best Regards >>>> >>>> John Damm S?rensen >>>> >>>> IT consultant >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From john at hovedpuden.dk Sat Jul 1 11:38:14 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Sat, 1 Jul 2017 19:38:14 +0200 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk>

Message-ID: Thanks Carson, One thing bothers me. That's this from Perl forks documentation: module load order: forks first Since forks overrides core Perl functions, you are *strongly* encouraged to load the forks module before any other Perl modules. This will insure the most consistent and stable system behavior. This can be easily done without affecting existing code, like: perl -Mforks script.pl But in the maker perlscript the module FindBin that in sturn loads a bunch of other modules is loaded before forks. Is that intentionally? Best John Den 29-06-2017 kl. 22:56 skrev Carson Holt: > Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. > > ?Carson > > > >> On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: >> >> MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. >> >> If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. >> >> I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). >> >> Thanks, >> Carson >> >> >>> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >>> >>> Hello, >>> >>> Recently I assisted one of my customers with problems solving maker using MPI. >>> >>> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >>> >>> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >>> >>> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >>> >>> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >>> >>> https://community.mellanox.com/thread/3439 >>> >>> >>> Best Regards >>> >>> John Damm S?rensen >>> >>> IT consultant >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 3 14:50:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Jul 2017 14:50:21 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498908228256.16549@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> <1498908228256.16549@unil.ch> Message-ID: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think). So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models. The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split). You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ Thanks, Carson > On Jul 1, 2017, at 5:21 AM, Patrick Tran Van wrote: > > So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. > > I have then use SNAP to train/filter it with: > > maker2zff specie.all.gff > > Here are my results: > > Number of gene after maker -> Number of gene after maker2zff > > - Without corrected_est_fusion: 21621 -> 13875 > - With corrected_est_fusion: 16850 -> 9098 > > 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? > Normally I should find more genes with corrected_est_fusion right ? > > 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? > > Thanks for your help > > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 26, 2017 11:38 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Sorry the option is ?> correct_est_fusion > > It is in the maker_opts.ctl file. > > I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. > > ?Carson > > > >> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: >> >> Thanks for your answer. >> >> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? >> Because I am using autoAug for this and it tooks a while to compute .. >> >> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: >> >> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl >> >> (I am using v 2.31.8 ) >> >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> From: Carson Holt > >> Sent: Monday, June 5, 2017 8:29 PM >> To: Patrick Tran Van >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Advice on my pipeline >> >> Your plan sounds good. A couple of related notes. >> >> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. >> >> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). >> >> ?Carson >> >> >>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >>> >>> Hello, >>> >>> This is my first time running Maker for an insect genome annotation. >>> >>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >>> >>> >>> What I have: >>> - RNA evidence: transcriptome >>> - Proteine evidence: swissprot/uniprot + busco protein set of insect >>> - Cegma and busco results of my genome >>> >>> >>> 1) Train SNAP with CEGMA >>> >>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >>> >>> 3) Create SNAP model from run A. >>> >>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >>> >>> 5) Create SNAP model from run B. >>> >>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >>> >>> 7) Create SNAP model from run C AND Create Augustus gene model from run C >>> >>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >>> >>> >>> >>> Does it seems coherent ? >>> >>> Cheers, >>> >>> Patrick Tran Van >>> >>> Groups Chapuisat, Robinson-Rechavi & Schwander >>> Department of Ecology and Evolution >>> University of Lausanne >>> Le Biophore >>> CH-1015 Lausanne >>> Switzerland >>> Office 3206 >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 3 15:04:40 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Jul 2017 15:04:40 -0600 Subject: [maker-devel] Possible ways to improve annotated gene numbers In-Reply-To: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> References: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Message-ID: <903B12C5-CC57-46F3-B3E6-1322C9155F2F@gmail.com> MAKER excludes models without evidence support (this is because gene predictors can overcall by as much as a factor of 10, i.e. lots of false positives). So you may be lacking in either protein or transcript evidence (you should alway supply a minimum of 2 related proteomes for any MAKER analysis - transcript evidence by itself is insufficient). You can also try and rescue models based on protein domain content using iprscan. Details in this protocol paper ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ ?Carson > On Jun 30, 2017, at 1:30 PM, Qihua Liang wrote: > > Dear Maker Development Team, > > Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. > > For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. > > But for all.maker.proteins.fasta, the completeness is only ~80%. > > 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? > > 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? > > > Thank you > Qihua > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Jul 4 22:05:10 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 5 Jul 2017 14:05:10 +1000 Subject: [maker-devel] advanced repeat libraries Message-ID: Hi, I'm dealing with a fungal genome with at least 40% of repeats, so I'm trying to follow the advanced repeat construction protocol. So far, so good, but I have doubts about how to build the protein database as explained at the end of the page In summary 1. get SwissProt and RefSeq fungal proteins 2. tblastn (from 1) against EST-NCBI database and keep the matches 3. blastp the output from 2 against the transposase protein db. Remove matches but from here on I'm a bit lost... "Finally, the rice protein sequences were compared with verified transposons (such as Pack-MULEs) in the rice genome. If the protein sequence matched a transposon perfectly and was the only perfect match in the genome, the relevant protein sequence was excluded. Although elements such as Pack-MULEs contain true gene sequences, the annotation (the protein sequence in the database) often extends to non-gene sequences such as terminal inverted repeat or sub-terminal repeat, which are not true plant proteins and would cause great complications. As a result, it is essential to exclude them." Are the proteins kept at the end of the step 3 the 'protein database'? Could you provide a bit more detail on how to tackle this? Thank you in advance, Xabi -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Thu Jul 6 06:45:20 2017 From: tfallon at mit.edu (Tim Fallon) Date: Thu, 6 Jul 2017 08:45:20 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Hi Carson, This region is definitely entirely correct at the genomic nucleotide level, no missassemblies. Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions? Right now this is what I?m thinking, as troubleshooting the ab-initio training seems like it could be a long road. All the best, -Tim > On Jun 26, 2017, at 6:00 PM, Carson Holt wrote: > > Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. > > In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. > > ?Carson > >> On Jun 22, 2017, at 10:59 PM, Tim Fallon > wrote: >> >> Hi Carson, >> >> Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. >> >> Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. >> >> Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? >> >> All the best, >> -Tim >> >>> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >>> >>> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >>> >>> ?Carson >>> >>> >>> >>>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>>> >>>> Hi there, >>>> >>>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>>> >>>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>>> >>>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>>> >>>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>>> >>>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>>> >>>> All the best, >>>> -Tim >>>> >>>> Timothy R. Fallon >>>> PhD candidate >>>> Laboratory of Jing-Ke Weng >>>> Department of Biology >>>> MIT >>>> >>>> tfallon at mit.edu >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From labovolenta at gmail.com Mon Jul 10 12:57:21 2017 From: labovolenta at gmail.com (Luiz Augusto Bovolenta) Date: Mon, 10 Jul 2017 15:57:21 -0300 Subject: [maker-devel] Error "Assertion ((sv)->sv_flags &" failed: file "mg.c" Message-ID: Hi colleagues. I recently installed the Maker using manual steps for dependencies. However, when I try to execute the maker command I receive this error: Assertion ((sv)->sv_flags & (0x00200000|0x00400000|0x00800000)) failed: file "mg.c", line 88 at /usr/lib/perl5/site_perl/5.10.0/Sys/SigAction.pm line 145. Compilation failed in require at ./maker line 45. BEGIN failed--compilation aborted at ./maker line 45. Someone have some idea about this error? Best regards Luiz -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 10 13:10:51 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 10 Jul 2017 13:10:51 -0600 Subject: [maker-devel] Error "Assertion ((sv)->sv_flags &" failed: file "mg.c" In-Reply-To: References: Message-ID: If you are installing without MPI support then, something is wrong with your perl installation or one of the modules installed with your perl. You may want to reinstall perl, or try and reinstall modules listed in the error one at a time using CPAN (use 'force install ? to force reinstall). Modules to try (some were given by name and others by line in your error): forks forks:shared Sys::SigAction Alternatively if this is an MPI install, make sure you have added the required environmental variables (i.e. LD_PRELOAD for OpenMPI) and command line flags (i.e. -mca btl ^openib) listed in the ?/maker/INSTALL file, and that you are not running an incompatible MPI flavor such as MVAPICH2 (also explained in the ?/maker/INSTALL file). ?Carson > On Jul 10, 2017, at 12:57 PM, Luiz Augusto Bovolenta wrote: > > Hi colleagues. > I recently installed the Maker using manual steps for dependencies. However, when I try to execute the maker command I receive this error: > > Assertion ((sv)->sv_flags & (0x00200000|0x00400000|0x00800000)) failed: file "mg.c", line 88 at /usr/lib/perl5/site_perl/5.10.0/Sys/SigAction.pm line 145. > Compilation failed in require at ./maker line 45. > BEGIN failed--compilation aborted at ./maker line 45. > > Someone have some idea about this error? > > Best regards > Luiz > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 10 13:20:15 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 10 Jul 2017 13:20:15 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu>

Message-ID: <4D5E9712-E95B-4687-8706-2AB445191C89@gmail.com> est2genome and protein2genome will almost always be partial. Also the error rate on draft assemblies is much higher than most people realize. Beyond issues already mentioned in the previous e-mail, there is also the issue that organisms are diploid, but the assembly is haploid, so variation gets squashed which also breaks ORFs (there are several examples of this in both the mature human and mouse genome assemblies). For many draft assemblies, you can expect ORF affecting errors in as much as 10-15% of your annotations. Try opening the cases with issues and manually editing them in Apollo. Possible sources of sequence guiding the annotation may become more apparent (look at mismatches in the mRNA-seq alignments relative to the assembly for example). And if not, and the region is just too complex for the predictor, then you can force the model with Apollo. ?Carson > On Jul 6, 2017, at 6:45 AM, Tim Fallon wrote: > > Hi Carson, > > This region is definitely entirely correct at the genomic nucleotide level, no missassemblies. Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions? Right now this is what I?m thinking, as troubleshooting the ab-initio training seems like it could be a long road. > > All the best, > -Tim > >> On Jun 26, 2017, at 6:00 PM, Carson Holt > wrote: >> >> Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. >> >> In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. >> >> ?Carson >> >>> On Jun 22, 2017, at 10:59 PM, Tim Fallon > wrote: >>> >>> Hi Carson, >>> >>> Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. >>> >>> Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. >>> >>> Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? >>> >>> All the best, >>> -Tim >>> >>>> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >>>> >>>> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>>>> >>>>> Hi there, >>>>> >>>>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>>>> >>>>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>>>> >>>>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>>>> >>>>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>>>> >>>>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>>>> >>>>> All the best, >>>>> -Tim >>>>> >>>>> Timothy R. Fallon >>>>> PhD candidate >>>>> Laboratory of Jing-Ke Weng >>>>> Department of Biology >>>>> MIT >>>>> >>>>> tfallon at mit.edu >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> Timothy R. Fallon >>> PhD candidate >>> Laboratory of Jing-Ke Weng >>> Department of Biology >>> MIT >>> >>> tfallon at mit.edu >> > > Timothy R. Fallon > PhD candidate > Laboratory of Jing-Ke Weng > Department of Biology > MIT > > tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Thu Jul 13 11:00:18 2017 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Thu, 13 Jul 2017 17:00:18 +0000 Subject: [maker-devel] Question regarding MAKER In-Reply-To: References: Message-ID: est2genome and protein2genome take BLAST hits, polish them with exonerate around splice sites and then turn the alignment directly into a gene model. So if the alignment is partial because the EST or mRNA-seq do not cross the entire transcript or the protein homology does not cross the entire CDS, then the resulting model will be partial. But hundreds of even partial models are sufficient to train SNAP. Then I usually do just one round of bootstrap training (more than that and you get into the overtraining paradox). So you can use just est2genome, just protein2genome, or both. You just need something to train SNAP with. ?Carson On Jul 11, 2017, at 3:37 PM, Ghosh, Arnab > wrote: Hi Carson, My name is Arnab and I am from Texas Tech University. I am using MAKER for gene annotation in a new genome assembly for a non-model organism. I have mostly figured out everything of this amazing piece of software but had two questions. 1. Is it okay to use only est2genome =1 and leave the protein2genome=0 option out in the first round of running MAKER ? Will it hurt my prediction and eventual annotation of gene if I don?t use the protein2genomeoption ALONGSIDE est2genome in the first round? I have a protein fasta file for the same organism but using the transcript fasta file (same organism) AND the protein fasta file for the whole genome (~ 2.2 GB in size) is just taking too long to finish. 1. I will of course run SNAP in the second round which also leads me to my second question as to what according to you is an acceptable number of iterations to run bootstrapping of SNAP with MAKER? Thanks and regards Arnab -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhushilin at frasergen.com Wed Jul 12 00:19:12 2017 From: zhushilin at frasergen.com (zhushilin at frasergen.com) Date: Wed, 12 Jul 2017 14:19:12 +0800 Subject: [maker-devel] some suggestion Message-ID: <2017071213575801507119@frasergen.com> Dear developer, It seems that MAKER can only run in the general hard disk which support structrued data, as SQLite was used. When running in lustre filesystem, we got I/O error and nothing was written to .db file which saves the gff information. Maybe the best way is to check the filesystem automatically and give the different strategy to store the information in gff files. Best wishes Shilin Zhu R&D director DEPT. of Bioinformatics Wuhan Frasergen Bioinformatics Co., Ltd B8 building?Biolake?666 Gaoxin Road?Wuhan East Lake High-tech Zone?Wuhan 430075?China T: 027-87224705?M: +86 18502745140 F: 027-87224785?E: service at frasergen.com W: http://www.frasergen.com Disclaimer This e-mail is intended to be used only by persons entitled to receive such information and may contain information that is confidential, proprietary, and/or legally privileged. If you are not the intended recipient, you are hereby notified that any use, retention, disclosure, dissemination, copying, or taking any other action in reliance on contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately contact the sender and delete the e-mail from your mailbox or any other storage mechanism. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 17 23:06:10 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Jul 2017 23:06:10 -0600 Subject: [maker-devel] some suggestion In-Reply-To: <2017071213575801507119@frasergen.com> References: <2017071213575801507119@frasergen.com> Message-ID: <7A0B500C-B473-4587-9302-E076E42E7734@gmail.com> The system we commonly use with MAKER is Lustre and we get no issues. We also commonly run MAKER on TACC which uses Lustre for all it?s file systems. So there is no Lustre limitation. MAKER does require that each node also have a local temporary directory for some operations which can generate high IOPS or require that traditional flock support (cannot be guaranteed on some NFS systems). These operations occur in the location specified by TMP= in the control file. Perhaps you are attempting to set your TMP value in the control files to a Lustre space which can overload the MDS (metadata server) used by Lustre. Make sure you do not set TMP to a shared location. Your working directory can be a shared Lustre space and result files will be stored there, but IO operations that are not safe for shared spaces will occur in TMP, and TMP must be set to a local storage location (usually /tmp). --Carson > On Jul 12, 2017, at 12:19 AM, zhushilin at frasergen.com wrote: > > Dear developer, > > It seems that MAKER can only run in the general hard disk which support structrued data, as SQLite was used. > When running in lustre filesystem, we got I/O error and nothing was written to .db file which saves the gff information. > > Maybe the best way is to check the filesystem automatically and give the different strategy to store the information in gff files. > > Best wishes > Shilin Zhu R&D director > DEPT. of Bioinformatics > Wuhan Frasergen Bioinformatics Co., Ltd > B8 building?Biolake?666 Gaoxin Road?Wuhan East Lake High-tech Zone?Wuhan 430075?China > T: 027-87224705?M: +86 18502745140 > F: 027-87224785?E: service at frasergen.com > W: http://www.frasergen.com > Disclaimer > This e-mail is intended to be used only by persons entitled to receive such information and may contain information that is confidential, proprietary, and/or legally privileged. If you are not the intended recipient, you are hereby notified that any use, retention, disclosure, dissemination, copying, or taking any other action in reliance on contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately contact the sender and delete the e-mail from your mailbox or any other storage mechanism. Thank you! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjfields at illinois.edu Tue Jul 18 13:18:25 2017 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 18 Jul 2017 19:18:25 +0000 Subject: [maker-devel] some suggestion In-Reply-To: <7A0B500C-B473-4587-9302-E076E42E7734@gmail.com> References: <2017071213575801507119@frasergen.com> <7A0B500C-B473-4587-9302-E076E42E7734@gmail.com> Message-ID: <56D57BD0-7102-4B7A-BFC4-3A72C2678CE4@illinois.edu> We?ve worked with MAKER on Lustre and GPFS w/o significant issues (we?re also in the process of setting up MAKER on cluster using Ceph). I?ve always found the trickiest part in initial MAKER setup and testing is making sure the proper tempfile space is set (point to local disk or /dev/shm) and wrangling MPI issues. chris From: maker-devel on behalf of Carson Holt Date: Tuesday, July 18, 2017 at 12:06 AM To: "zhushilin at frasergen.com" Cc: maker-devel Subject: Re: [maker-devel] some suggestion The system we commonly use with MAKER is Lustre and we get no issues. We also commonly run MAKER on TACC which uses Lustre for all it?s file systems. So there is no Lustre limitation. MAKER does require that each node also have a local temporary directory for some operations which can generate high IOPS or require that traditional flock support (cannot be guaranteed on some NFS systems). These operations occur in the location specified by TMP= in the control file. Perhaps you are attempting to set your TMP value in the control files to a Lustre space which can overload the MDS (metadata server) used by Lustre. Make sure you do not set TMP to a shared location. Your working directory can be a shared Lustre space and result files will be stored there, but IO operations that are not safe for shared spaces will occur in TMP, and TMP must be set to a local storage location (usually /tmp). --Carson On Jul 12, 2017, at 12:19 AM, zhushilin at frasergen.com wrote: Dear developer, It seems that MAKER can only run in the general hard disk which support structrued data, as SQLite was used. When running in lustre filesystem, we got I/O error and nothing was written to .db file which saves the gff information. Maybe the best way is to check the filesystem automatically and give the different strategy to store the information in gff files. Best wishes ________________________________ Shilin Zhu R&D director DEPT. of Bioinformatics Wuhan Frasergen Bioinformatics Co., Ltd B8 building?Biolake?666 Gaoxin Road?Wuhan East Lake High-tech Zone?Wuhan 430075?China T: 027-87224705?M: +86 18502745140 F: 027-87224785?E: service at frasergen.com W: http://www.frasergen.com ________________________________ Disclaimer This e-mail is intended to be used only by persons entitled to receive such information and may contain information that is confidential, proprietary, and/or legally privileged. If you are not the intended recipient, you are hereby notified that any use, retention, disclosure, dissemination, copying, or taking any other action in reliance on contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately contact the sender and delete the e-mail from your mailbox or any other storage mechanism. Thank you! _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Wed Jul 19 07:11:58 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Wed, 19 Jul 2017 13:11:58 +0000 Subject: [maker-devel] MAKER annotation post processing Message-ID: Hi, I have successfully annotated my genome with MAKER. Now I have a gff file that I want to post process /filter. In particular, I would like to discard genes that are below to a certain AED score. 1) Is there an AED treshold from where a gene is not strongly supported ? if yes, do you have some reference about this ? 2) Is there a script/software to process a gff file ? Thanks Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Wed Jul 19 11:20:07 2017 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Wed, 19 Jul 2017 13:20:07 -0400 Subject: [maker-devel] MAKER annotation post processing In-Reply-To: References: Message-ID: Hi Patrick, For point 1, the best AED cutoff to use is quite arbitrary. For one the last genomes that I annotated we had a set of high quality genes identified based on synteny with genes in closely related genomes. We plotted the distribution of AEDs for those genes and found that a cutoff of 0.28 captured 98% of the high quality genes. This value would vary based on the evidence provided. I?ve used 0.5 in the past as a more permissive filter. For point 2, these is a accessory script in the MAKER bin called quality_filter.pl. It has an option (-a) that allows you to put in an AED cutoff and it will filter the gff3 file based on that cutoff. For general processing of GFF3 files, there is a perl library called GAL that is useful if you write code in perl. Take care, Mike > On Jul 19, 2017, at 9:11 AM, Patrick Tran Van wrote: > > Hi, > I have successfully annotated my genome with MAKER. Now I have a gff file that I want to post process /filter. > > In particular, I would like to discard genes that are below to a certain AED score. > > 1) Is there an AED treshold from where a gene is not strongly supported ? if yes, do you have some reference about this ? > > 2) Is there a script/software to process a gff file ? > > Thanks > > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Tue Jul 25 15:48:45 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Tue, 25 Jul 2017 17:48:45 -0400 Subject: [maker-devel] Repeat annotation by Maker2 Message-ID: Hello: We want to summarize the statistical information of repeats for the genome annotated by Maker2. But we are not clear what does the annotation mean. Would you explain? Many thanks! Let me take this example CasCan_contig_16053 repeatmasker match 35887 35996 423 + . ID=CasCan_contig_16053:hit:51261:1.3.0.0;Name=species:Charlie4z|genus:DNA%2FhAT-Charlie;Target=species:Charlie4z|genus:DNA%2FhAT-Charlie 48 161 + (1) "35887" and "35996" are the start and end position of the "match" in this contig, and so for this repeat element it covers 35996-35887+1 (i.e., 110bp) in the contig. Right? (2) What does the "Name=species" (and "Target=species") mean? (3) "genus" show the type of repeat element, right? Then what does "%" mean in "DNA%2FhAT-Charlie" ? (4) what does "48" and "161" mean? Are they the coordinates of the "match" in the repeat element? Examples: CasCan_contig_16053 repeatmasker match 35887 35996 423 + . ID=CasCan_contig_16053:hit:51261:1.3.0.0;Name=species:Charlie4z|genus:DNA%2FhAT-Charlie;Target=species:Charlie4z|genus:DNA%2FhAT-Charlie 48 161 + CasCan_contig_16053 repeatmasker match_part 35887 35996 423 + . ID=CasCan_contig_16053:hsp:120045:1.3.0.0;Parent=CasCan_contig_16053:hit:51261:1.3.0.0;Target=species:Charlie4z|genus:DNA%252FhAT-Charlie 48 161 + CasCan_contig_16053 repeatmasker match 36842 37881 2546 + . ID=CasCan_contig_16053:hit:51262:1.3.0.0;Name=species:L1MC1_EC|genus:LINE%2FL1;Target=species:L1MC1_EC|genus:LINE%2FL1 5384 6062 + CasCan_contig_16053 repeatmasker match_part 36842 37881 2546 + . ID=CasCan_contig_16053:hsp:120046:1.3.0.0;Parent=CasCan_contig_16053:hit:51262:1.3.0.0;Target=species:L1MC1_EC|genus:LINE%252FL1 5384 6062 + Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Thu Jul 27 14:25:19 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Thu, 27 Jul 2017 16:25:19 -0400 Subject: [maker-devel] update the repeatMasker Message-ID: Hello: We had installed the maker2 on our server. We found the existing repeatMasker version is too old, so I plan to install the latest version repeatMasker. I wonder what I need to do to let the maker2 call the latest version of repeatMasker rather than the existing old version? Many thanks Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jul 29 16:20:10 2017 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 29 Jul 2017 16:20:10 -0600 Subject: [maker-devel] update the repeatMasker In-Reply-To: References: Message-ID: <47CF01D4-8146-4A48-9FEE-E9BE75A03C82@gmail.com> The location of executables used is found in the maker_exe.ctl file. Default values in that file are drawn from your PATH environmental variable. You need to add the new RepeatMasker installation to your PATH. Also if maker_exe.ctl predates the change to the path, you will need to manually set the location in that file. Thanks, Carson > On Jul 27, 2017, at 2:25 PM, Quanwei Zhang wrote: > > Hello: > > We had installed the maker2 on our server. We found the existing repeatMasker version is too old, so I plan to install the latest version repeatMasker. I wonder what I need to do to let the maker2 call the latest version of repeatMasker rather than the existing old version? > > Many thanks > > Best > Quanwei > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From qwzhang0601 at gmail.com Mon Jul 31 10:42:04 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 31 Jul 2017 12:42:04 -0400 Subject: [maker-devel] repeats masking Message-ID: Hello: We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html Use in association with gene prediction programs Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 10:48:44 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 10:48:44 -0600 Subject: [maker-devel] repeats masking In-Reply-To: References: Message-ID: MAKER uses the masking primarily for the evidence alignment step. Low complexity regions are soft masked which means alignments can extend through them but must seed outside of the masked region first. Successful BLAST alignments are then polished using exonerate on the unmasked region. Also for the gene predictor, the first run is done with hard masking of the transposons only. So they can still predict in low complexity regions. The second round of hint based prediction is done on the unmasked assembly. So MAKER already handles all the issues you are mentioning. --Carson > On Jul 31, 2017, at 10:42 AM, Quanwei Zhang wrote: > > Hello: > > We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks > > > The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html > Use in association with gene prediction programs > > Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. > > Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. > > > Best > > Quanwei > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From leo1985.arnab at gmail.com Wed Jul 19 14:24:13 2017 From: leo1985.arnab at gmail.com (Arnab Ghosh) Date: Wed, 19 Jul 2017 15:24:13 -0500 Subject: [maker-devel] BUSCO trained Augustus parameter for MAKER Message-ID: Hello, I am trying to annotate genes in a non-model organism. I am using MAKER for this purpose and have trained SNAP and plan on Genemark and Augustus as well for the predictions. After searching the google group on MAKER-devel I understand BUSCO is a good choice for training Augustus and decided to give it a shot. From one of the earlier posts on BUSCO, I read that it is supposed to generate a species name specific folder in the config directory of Augustus. However I did not get any. The BUSCO run finished fine with the run_species folder and all necessary table, hmm folder and other files etc. generated under the run_species folder. Following was the command i used to run BUSCO: python run_BUSCO.py --in mySpecies.fasta --out my_species --long --lineage_path /usr/home/aves_odb9 --mode genome So there was no "my_species" folder generated under the config directory of augustus. Did I miss something ? My question is-- what should I pass to the " *augustus_species"* variable now in the maker control file for the next run of maker ? Thanks so much for your time !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From dandence at gmail.com Mon Jul 31 10:53:11 2017 From: dandence at gmail.com (Daniel Ence) Date: Mon, 31 Jul 2017 12:53:11 -0400 Subject: [maker-devel] repeats masking In-Reply-To: References: Message-ID: <6FCA8DEC-EA5E-489F-A83C-2BA792E1F77D@gmail.com> Hi Quanwei, Running maker on the unmasked genome will probably give you more genes, but won?t be helpful in the end. Maker soft-masks repeats, which prevents blast alignments from being seeded in the masked regions, but still allows them to extend into those regions. This solves the problem missing exons mentioned in the text you sent. There?s an option in the control file to run the ab-inition programs on the unmasked sequence (?unmask?) which is set to false (0) by default. Hope this helps, Daniel Ence > On Jul 31, 2017, at 12:42 PM, Quanwei Zhang wrote: > > Hello: > > We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks > > > The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html > Use in association with gene prediction programs > > Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. > > Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. > > > Best > > Quanwei > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Jul 31 10:59:02 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 31 Jul 2017 12:59:02 -0400 Subject: [maker-devel] repeats masking In-Reply-To: References:

Message-ID: Hi Carson: I see. Thank you for your explanation! Best Quanwei 2017-07-31 12:48 GMT-04:00 Carson Holt : > MAKER uses the masking primarily for the evidence alignment step. Low > complexity regions are soft masked which means alignments can extend > through them but must seed outside of the masked region first. Successful > BLAST alignments are then polished using exonerate on the unmasked region. > > Also for the gene predictor, the first run is done with hard masking of > the transposons only. So they can still predict in low complexity regions. > The second round of hint based prediction is done on the unmasked assembly. > So MAKER already handles all the issues you are mentioning. > > --Carson > > > > > > On Jul 31, 2017, at 10:42 AM, Quanwei Zhang wrote: > > Hello: > > We are using the Maker2 pipeline to annotating a new genome. We just read > something about the repeat masking from repeatMasker's documents. It > suggests to leave low complexity region unmasked and to do gene annotation > using both masked and unmasked genome. I wonder what your opinion and > suggestions on this? Many thanks > > > The paragraph below is from http://www.binfo.ncku.edu.tw/ > RM/webrepeatmaskerhelp.html > Use in association with gene prediction programs > > Predicting genes from a masked sequence faces several problems. First, > one should not mask low complexity regions, e.g. to avoid masking > trinucleotide repeats in coding regions. But even with only interspersed > repeats masked, gene prediction programs may fail to identify exons > correctly. As mentioned above, sometimes tail ends of coding regions may > have originated from transposable elements. Even if no coding regions have > been masked, splice sites may be compromised; e.g. the polypyrimidine > region that is part of the acceptor splice site may be contained within a > repeat. > > Thus, I generally recommend to run a gene prediction program on unmasked > DNA (as well) and compare the predicted genes and exons with the > RepeatMasker output. Some gene prediction program allow you to force > certain exons out of the predictions (e.g. often the old ORFs of LINE1 > elements and endogenous retroviruses are included in genes). Work is also > in progress at several sites to incorporate RepeatMasker into gene > prediction programs, in which cases matches to repeats are weighted in > along with the other parameters used. > > Best > > Quanwei > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 11:02:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 11:02:21 -0600 Subject: [maker-devel] repeats masking In-Reply-To: <6FCA8DEC-EA5E-489F-A83C-2BA792E1F77D@gmail.com> References: <6FCA8DEC-EA5E-489F-A83C-2BA792E1F77D@gmail.com> Message-ID: <256BF853-3B43-4975-9D5D-9D4F30A27AC3@gmail.com> Please note that the unmask option Dan is talking about is a feature to run both masked and unmasked raw predictions in the first round of prediction (it does not affect alignemnt of the second round of predictiopn). It tends to increase the false positive rate but can be a quick test when you believe you are missing a gene because of overmasking from a user created library and protein/EST evidence is overly sparse (so the gene cannot be recovered through evidence alignment and the second round of unmasked prediction). ?Carson > On Jul 31, 2017, at 10:53 AM, Daniel Ence wrote: > > Hi Quanwei, Running maker on the unmasked genome will probably give you more genes, but won?t be helpful in the end. Maker soft-masks repeats, which prevents blast alignments from being seeded in the masked regions, but still allows them to extend into those regions. This solves the problem missing exons mentioned in the text you sent. There?s an option in the control file to run the ab-inition programs on the unmasked sequence (?unmask?) which is set to false (0) by default. > > Hope this helps, > Daniel Ence > > >> On Jul 31, 2017, at 12:42 PM, Quanwei Zhang > wrote: >> >> Hello: >> >> We are using the Maker2 pipeline to annotating a new genome. We just read something about the repeat masking from repeatMasker's documents. It suggests to leave low complexity region unmasked and to do gene annotation using both masked and unmasked genome. I wonder what your opinion and suggestions on this? Many thanks >> >> >> The paragraph below is from http://www.binfo.ncku.edu.tw/RM/webrepeatmaskerhelp.html >> Use in association with gene prediction programs >> >> Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat. >> >> Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. >> >> >> Best >> >> Quanwei >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 11:09:58 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 11:09:58 -0600 Subject: [maker-devel] BUSCO trained Augustus parameter for MAKER In-Reply-To: References: Message-ID: <453DB44C-2277-4939-A0BA-0DE083F3E380@gmail.com> Hello Arnab, Perhaps someone on the list will chime in with an answer for you, but you may also want to post to direct to BUSCO since your question is entirely related to BUSCO and you are more likely to get a quick response there. ?> https://gitlab.com/ezlab/busco/issues Thanks, Carson > On Jul 19, 2017, at 2:24 PM, Arnab Ghosh wrote: > > Hello, > > I am trying to annotate genes in a non-model organism. I am using MAKER for this purpose and have trained SNAP and plan on Genemark and Augustus as well for the predictions. > > After searching the google group on MAKER-devel I understand BUSCO is a good choice for training Augustus and decided to give it a shot. From one of the earlier posts on BUSCO, I read that it is supposed to generate a species name specific folder in the config directory of Augustus. However I did not get any. The BUSCO run finished fine with the run_species folder and all necessary table, hmm folder and other files etc. generated under the run_species folder. > > Following was the command i used to run BUSCO: > > python run_BUSCO.py --in mySpecies.fasta --out my_species --long --lineage_path /usr/home/aves_odb9 --mode genome > > So there was no "my_species" folder generated under the config directory of augustus. > > Did I miss something ? My question is-- what should I pass to the "augustus_species" variable now in the maker control file for the next run of maker ? > > Thanks so much for your time !! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Jul 31 17:02:29 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 31 Jul 2017 19:02:29 -0400 Subject: [maker-devel] Pseudogene identification Message-ID: Hello: We used Maker2 to annotate a new rodent genome. By using the annotated genes we did gene family expansion analysis, and found several gene families under expansion in the new rodent genome. But we want to check whether some annotated genes are Pseudogenes, which lead to the expansion. Do you have any suggestions on this? We found the Maker-P can annotate Pseudogene, but we are not sure whether it is worth to repeat our annotation with Maker-P. Besides, we are not sure whether the default parameters of Maker-P are good for a rodent species. What's more, in my understanding the Maker-P will identify Pseudogenes in the intergenic spaces (which I think the annotated coding genes will be not be tested and checked). Do you have any suggestions to solve our problem? We do not want to identify Pseudogene on the genome wide, but only want to check those genes showing expansion (to make sure all those gene copies really function). Many thanks Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Jul 31 17:46:52 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Tue, 1 Aug 2017 09:46:52 +1000 Subject: [maker-devel] BUSCO trained Augustus parameter for MAKER In-Reply-To: <453DB44C-2277-4939-A0BA-0DE083F3E380@gmail.com> References: <453DB44C-2277-4939-A0BA-0DE083F3E380@gmail.com> Message-ID: AFAIK, the profile can be the folder run_my_species/augustus_output/retraining_parameters and the files inside would look like BUSCO_my_species_[a number]_*.* Create a folder in the species directory and copy this in there. Note that the folder and the included files should have the same name prefix e.g. BUSCO_my_species_[a number], otherwise whenever you call the profile, Augustus is going to freak out. The prefix will be your augustus_species On 1 August 2017 at 03:09, Carson Holt wrote: > Hello Arnab, > > Perhaps someone on the list will chime in with an answer for you, but you > may also want to post to direct to BUSCO since your question is entirely > related to BUSCO and you are more likely to get a quick response there. > > ?> https://gitlab.com/ezlab/busco/issues > > Thanks, > Carson > > > > > On Jul 19, 2017, at 2:24 PM, Arnab Ghosh wrote: > > Hello, > > I am trying to annotate genes in a non-model organism. I am using MAKER > for this purpose and have trained SNAP and plan on Genemark and Augustus as > well for the predictions. > > After searching the google group on MAKER-devel I understand BUSCO is a > good choice for training Augustus and decided to give it a shot. From one > of the earlier posts on BUSCO, I read that it is supposed to generate a > species name specific folder in the config directory of Augustus. However I > did not get any. The BUSCO run finished fine with the run_species folder > and all necessary table, hmm folder and other files etc. generated under > the run_species folder. > > Following was the command i used to run BUSCO: > > python run_BUSCO.py --in mySpecies.fasta --out my_species --long > --lineage_path /usr/home/aves_odb9 --mode genome > > So there was no "my_species" folder generated under the config directory > of augustus. > > Did I miss something ? My question is-- what should I pass to the " > *augustus_species"* variable now in the maker control file for the next > run of maker ? > > Thanks so much for your time !! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 31 17:54:12 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 31 Jul 2017 17:54:12 -0600 Subject: [maker-devel] Pseudogene identification In-Reply-To: References: Message-ID: The MAKER-P fork was merged back into standard MAKER with version 2.29 (roughly 3 years ago - a separate download no longer exists). This is because MAKER-P?s functionality is almost entirely in accessory scripts and written protocols. The ?/maker/bin/maker called by both MAKER2 and MAKER-P is actually the same script. So no need to rerun, because if you are using version 2.29 or later, you already ran it. Pseudogene calling is therefore handled by accessory scripts and protocols you can find here ?> http://shiulab.plantbiology.msu.edu/wiki/index.php/Protocol:Pseudogene The other MAKER-P protocols can be found here ?> http://www.yandell-lab.org/software/maker-p.html --Carson > On Jul 31, 2017, at 5:02 PM, Quanwei Zhang wrote: > > Hello: > > We used Maker2 to annotate a new rodent genome. By using the annotated genes we did gene family expansion analysis, and found several gene families under expansion in the new rodent genome. But we want to check whether some annotated genes are Pseudogenes, which lead to the expansion. Do you have any suggestions on this? > > We found the Maker-P can annotate Pseudogene, but we are not sure whether it is worth to repeat our annotation with Maker-P. Besides, we are not sure whether the default parameters of Maker-P are good for a rodent species. What's more, in my understanding the Maker-P will identify Pseudogenes in the intergenic spaces (which I think the annotated coding genes will be not be tested and checked). > > Do you have any suggestions to solve our problem? We do not want to identify Pseudogene on the genome wide, but only want to check those genes showing expansion (to make sure all those gene copies really function). > > Many thanks > > Best > Quanwei > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Sat Jul 1 05:21:37 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Sat, 1 Jul 2017 11:21:37 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch>, <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> Message-ID: <1498908228256.16549@unil.ch> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. I have then use SNAP to train/filter it with: maker2zff specie.all.gff Here are my results: Number of gene after maker -> Number of gene after maker2zff - Without corrected_est_fusion: 21621 -> 13875 - With corrected_est_fusion: 16850 -> 9098 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? Normally I should find more genes with corrected_est_fusion right ? 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? Thanks for your help Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, June 26, 2017 11:38 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt > Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sat Jul 1 11:41:28 2017 From: carsonhh at gmail.com (Carson Holt) Date: Sat, 1 Jul 2017 11:41:28 -0600 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk>

Message-ID: FindBin is necessary for library control and is safe to load before forks. ?Carson > On Jul 1, 2017, at 11:38 AM, John Damm S?rensen wrote: > > Thanks Carson, > > One thing bothers me. That's this from Perl forks documentation: > > module load order: forks first > > Since forks overrides core Perl functions, you are *strongly* encouraged to load the forks module before any other Perl modules. This will insure the most consistent and stable system behavior. This can be easily done without affecting existing code, like: > > perl -Mforks script.pl > > But in the maker perlscript the module FindBin that in sturn loads a bunch of other modules is loaded before forks. > > Is that intentionally? > > Best > > John > > > > > Den 29-06-2017 kl. 22:56 skrev Carson Holt: >> Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. >> >> ?Carson >> >> >> >>> On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: >>> >>> MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. >>> >>> If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. >>> >>> I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). >>> >>> Thanks, >>> Carson >>> >>> >>>> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >>>> >>>> Hello, >>>> >>>> Recently I assisted one of my customers with problems solving maker using MPI. >>>> >>>> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >>>> >>>> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >>>> >>>> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >>>> >>>> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >>>> >>>> https://community.mellanox.com/thread/3439 >>>> >>>> >>>> Best Regards >>>> >>>> John Damm S?rensen >>>> >>>> IT consultant >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > From john at hovedpuden.dk Sat Jul 1 11:38:14 2017 From: john at hovedpuden.dk (=?UTF-8?Q?John_Damm_S=c3=b8rensen?=) Date: Sat, 1 Jul 2017 19:38:14 +0200 Subject: [maker-devel] maker with MPI and perl using threads In-Reply-To: References: <9218f472-d9e4-179c-c712-ddf09e97d8fc@hovedpuden.dk>

Message-ID: Thanks Carson, One thing bothers me. That's this from Perl forks documentation: module load order: forks first Since forks overrides core Perl functions, you are *strongly* encouraged to load the forks module before any other Perl modules. This will insure the most consistent and stable system behavior. This can be easily done without affecting existing code, like: perl -Mforks script.pl But in the maker perlscript the module FindBin that in sturn loads a bunch of other modules is loaded before forks. Is that intentionally? Best John Den 29-06-2017 kl. 22:56 skrev Carson Holt: > Also when you see calls like threads->new() in the maker script. It is creating a fork and not a thread. It?s a convenience feature activated by the ?use forks? line at the top of the script. It allows you to use thread like syntax when working with forks. > > ?Carson > > > >> On Jun 29, 2017, at 2:43 PM, Carson Holt wrote: >> >> MAKER doesn?t use threads. It uses forks. There are several reasons for this, including that Perl already requires you to use hidden fork operations every time you call system() or open(). So trying to get forks, threads, and MPI together is just a mess. So we stuck with just forks and MPI. With MPI flavors with direct infiniband support you can get weird errors because of how forks affect registered memory when the MPI flavor uses OpenIB libraries. There is also another issue with how perl wraps malloc calls that affects registered memory. The best way around both these issues is too disable direct infiniband support, and then set the MPI flavor to use -tcp over the ip-over-infiniband virtual adaptor (usually ib0) or use the ethernet adapter. Sometime last year after a system update we also got an OpenMPI error (only on CentOS6) that referred to a file used for Perl threads (which should not even be in use since we are using forks). We worked around that issue by compile Perl without thread support so that the library couldn't be called. >> >> If you were using OpenMPI and were seeing a reference to Perl threads, then the error you saw may have been related to the latter one I mentioned. >> >> I have tried the MPI_Init_thread option before and had run into issues because of it without it helping any of the previously mentioned fork related issues. But that was some time ago, so I could try it as a solution to the last issue I mentioned rather than installing a no-thread version of perl (if I can ever replicate the error because it went away when we updated from CentOS kernel 6 to kernel 7). >> >> Thanks, >> Carson >> >> >>> On Jun 28, 2017, at 2:54 AM, John Damm S?rensen wrote: >>> >>> Hello, >>> >>> Recently I assisted one of my customers with problems solving maker using MPI. >>> >>> It seems that the main reason for the trouble was maker not initializing the MPI environment for thread save execution. >>> >>> In the MPI.pm module you call MPI_Init whereas for a threaded environment you should call MPI_Init_thread. >>> >>> I think it would be a good idea to detect whether the users Perl is with thread support and init MPI accordingly or clearly state that maker is for unthreaded Perl only. >>> >>> During the debugging we also found that it was beneficial to have the latest mxm.c installed: >>> >>> https://community.mellanox.com/thread/3439 >>> >>> >>> Best Regards >>> >>> John Damm S?rensen >>> >>> IT consultant >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 3 14:50:21 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Jul 2017 14:50:21 -0600 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <1498908228256.16549@unil.ch> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> <1498908228256.16549@unil.ch> Message-ID: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think). So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models. The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split). You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ Thanks, Carson > On Jul 1, 2017, at 5:21 AM, Patrick Tran Van wrote: > > So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. > > I have then use SNAP to train/filter it with: > > maker2zff specie.all.gff > > Here are my results: > > Number of gene after maker -> Number of gene after maker2zff > > - Without corrected_est_fusion: 21621 -> 13875 > - With corrected_est_fusion: 16850 -> 9098 > > 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? > Normally I should find more genes with corrected_est_fusion right ? > > 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? > > Thanks for your help > > > > Patrick Tran Van > > Groups Chapuisat, Robinson-Rechavi & Schwander > Department of Ecology and Evolution > University of Lausanne > Le Biophore > CH-1015 Lausanne > Switzerland > Office 3206 > > From: Carson Holt > > Sent: Monday, June 26, 2017 11:38 PM > To: Patrick Tran Van > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Advice on my pipeline > > Sorry the option is ?> correct_est_fusion > > It is in the maker_opts.ctl file. > > I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. > > ?Carson > > > >> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: >> >> Thanks for your answer. >> >> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? >> Because I am using autoAug for this and it tooks a while to compute .. >> >> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: >> >> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl >> >> (I am using v 2.31.8 ) >> >> >> Patrick Tran Van >> >> Groups Chapuisat, Robinson-Rechavi & Schwander >> Department of Ecology and Evolution >> University of Lausanne >> Le Biophore >> CH-1015 Lausanne >> Switzerland >> Office 3206 >> >> From: Carson Holt > >> Sent: Monday, June 5, 2017 8:29 PM >> To: Patrick Tran Van >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Advice on my pipeline >> >> Your plan sounds good. A couple of related notes. >> >> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. >> >> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). >> >> ?Carson >> >> >>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: >>> >>> Hello, >>> >>> This is my first time running Maker for an insect genome annotation. >>> >>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: >>> >>> >>> What I have: >>> - RNA evidence: transcriptome >>> - Proteine evidence: swissprot/uniprot + busco protein set of insect >>> - Cegma and busco results of my genome >>> >>> >>> 1) Train SNAP with CEGMA >>> >>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). >>> >>> 3) Create SNAP model from run A. >>> >>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >>> >>> 5) Create SNAP model from run B. >>> >>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). >>> >>> 7) Create SNAP model from run C AND Create Augustus gene model from run C >>> >>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 >>> >>> >>> >>> Does it seems coherent ? >>> >>> Cheers, >>> >>> Patrick Tran Van >>> >>> Groups Chapuisat, Robinson-Rechavi & Schwander >>> Department of Ecology and Evolution >>> University of Lausanne >>> Le Biophore >>> CH-1015 Lausanne >>> Switzerland >>> Office 3206 >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 3 15:04:40 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 3 Jul 2017 15:04:40 -0600 Subject: [maker-devel] Possible ways to improve annotated gene numbers In-Reply-To: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> References: <82A335D3-085B-414E-802C-8E2918EA7EA0@ucr.edu> Message-ID: <903B12C5-CC57-46F3-B3E6-1322C9155F2F@gmail.com> MAKER excludes models without evidence support (this is because gene predictors can overcall by as much as a factor of 10, i.e. lots of false positives). So you may be lacking in either protein or transcript evidence (you should alway supply a minimum of 2 related proteomes for any MAKER analysis - transcript evidence by itself is insufficient). You can also try and rescue models based on protein domain content using iprscan. Details in this protocol paper ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ ?Carson > On Jun 30, 2017, at 1:30 PM, Qihua Liang wrote: > > Dear Maker Development Team, > > Hi, I am using Maker for annotation and BUSCO to evaluate the completeness. > > For de novo perditions, I am using Augustus, GeneMark, and SNAP, and the annotated proteins have completeness of ~80%, ~50%, ~50% correspondingly. When I cat all de novo annotated proteins of these three tools, the completeness is much higher as ~92%. > > But for all.maker.proteins.fasta, the completeness is only ~80%. > > 1. Does this mean that some proteins annotated by Augustus/GeneMark/SNAP, are not included in the file all.maker.proteins.fasta? Does it because such excluded proteins do not have hits with the EST evidences? > > 2. To achieve a higher BUSCO completeness, what possible ways can be used? Including more EST evidences from other species? > > > Thank you > Qihua > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Jul 4 22:05:10 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 5 Jul 2017 14:05:10 +1000 Subject: [maker-devel] advanced repeat libraries Message-ID: Hi, I'm dealing with a fungal genome with at least 40% of repeats, so I'm trying to follow the advanced repeat construction protocol. So far, so good, but I have doubts about how to build the protein database as explained at the end of the page In summary 1. get SwissProt and RefSeq fungal proteins 2. tblastn (from 1) against EST-NCBI database and keep the matches 3. blastp the output from 2 against the transposase protein db. Remove matches but from here on I'm a bit lost... "Finally, the rice protein sequences were compared with verified transposons (such as Pack-MULEs) in the rice genome. If the protein sequence matched a transposon perfectly and was the only perfect match in the genome, the relevant protein sequence was excluded. Although elements such as Pack-MULEs contain true gene sequences, the annotation (the protein sequence in the database) often extends to non-gene sequences such as terminal inverted repeat or sub-terminal repeat, which are not true plant proteins and would cause great complications. As a result, it is essential to exclude them." Are the proteins kept at the end of the step 3 the 'protein database'? Could you provide a bit more detail on how to tackle this? Thank you in advance, Xabi -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfallon at mit.edu Thu Jul 6 06:45:20 2017 From: tfallon at mit.edu (Tim Fallon) Date: Thu, 6 Jul 2017 08:45:20 -0400 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu> Message-ID: Hi Carson, This region is definitely entirely correct at the genomic nucleotide level, no missassemblies. Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions? Right now this is what I?m thinking, as troubleshooting the ab-initio training seems like it could be a long road. All the best, -Tim > On Jun 26, 2017, at 6:00 PM, Carson Holt wrote: > > Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN?s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. > > In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well. > > ?Carson > >> On Jun 22, 2017, at 10:59 PM, Tim Fallon > wrote: >> >> Hi Carson, >> >> Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper. Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I?m annotating has large introns, and also tandem gene clusters of homologous genes, so I?ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly. >> >> Regarding the protein2genome only being a intermediate stage, as I?ve been working towards a final annotation, I?ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein). I also trained SNAP, but those predictions were worse than the Augustus predictions. >> >> Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta? That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions. Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence? >> >> All the best, >> -Tim >> >>> On Jun 23, 2017, at 12:27 AM, Carson Holt > wrote: >>> >>> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data. >>> >>> ?Carson >>> >>> >>> >>>> On Jun 13, 2017, at 11:35 AM, Tim Fallon > wrote: >>>> >>>> Hi there, >>>> >>>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq). I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus. >>>> >>>> I?ve noticed that the maker blastx "protein_match? feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family. See attached image. >>>> >>>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous. The top track is the blastx ?match_part? features, the bottom track is the blastx ?protein_match? features. You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn?t have blastx HSP support in the blastx ?match_part? track. The trick seems to be that a single reference protein, has blastx matches on both the left and right gene. >>>> >>>> Cleary this isn?t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus? >>>> >>>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening? For species that are closer, I?ve set the ?eval_blastx? to be a lot higher (1e-50), and in that case the genes don?t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity). I do have (rare) introns ~1000 bp, so I wouldn?t want to change the Maker ?split_hit? parameter to be too low. >>>> >>>> All the best, >>>> -Tim >>>> >>>> Timothy R. Fallon >>>> PhD candidate >>>> Laboratory of Jing-Ke Weng >>>> Department of Biology >>>> MIT >>>> >>>> tfallon at mit.edu >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> Timothy R. Fallon >> PhD candidate >> Laboratory of Jing-Ke Weng >> Department of Biology >> MIT >> >> tfallon at mit.edu > Timothy R. Fallon PhD candidate Laboratory of Jing-Ke Weng Department of Biology MIT tfallon at mit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1849 bytes Desc: not available URL: From labovolenta at gmail.com Mon Jul 10 12:57:21 2017 From: labovolenta at gmail.com (Luiz Augusto Bovolenta) Date: Mon, 10 Jul 2017 15:57:21 -0300 Subject: [maker-devel] Error "Assertion ((sv)->sv_flags &" failed: file "mg.c" Message-ID: Hi colleagues. I recently installed the Maker using manual steps for dependencies. However, when I try to execute the maker command I receive this error: Assertion ((sv)->sv_flags & (0x00200000|0x00400000|0x00800000)) failed: file "mg.c", line 88 at /usr/lib/perl5/site_perl/5.10.0/Sys/SigAction.pm line 145. Compilation failed in require at ./maker line 45. BEGIN failed--compilation aborted at ./maker line 45. Someone have some idea about this error? Best regards Luiz -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jul 10 13:10:51 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 10 Jul 2017 13:10:51 -0600 Subject: [maker-devel] Error "Assertion ((sv)->sv_flags &" failed: file "mg.c" In-Reply-To: References: Message-ID: If you are installing without MPI support then, something is wrong with your perl installation or one of the modules installed with your perl. You may want to reinstall perl, or try and reinstall modules listed in the error one at a time using CPAN (use 'force install ? to force reinstall). Modules to try (some were given by name and others by line in your error): forks forks:shared Sys::SigAction Alternatively if this is an MPI install, make sure you have added the required environmental variables (i.e. LD_PRELOAD for OpenMPI) and command line flags (i.e. -mca btl ^openib) listed in the ?/maker/INSTALL file, and that you are not running an incompatible MPI flavor such as MVAPICH2 (also explained in the ?/maker/INSTALL file). ?Carson > On Jul 10, 2017, at 12:57 PM, Luiz Augusto Bovolenta wrote: > > Hi colleagues. > I recently installed the Maker using manual steps for dependencies. However, when I try to execute the maker command I receive this error: > > Assertion ((sv)->sv_flags & (0x00200000|0x00400000|0x00800000)) failed: file "mg.c", line 88 at /usr/lib/perl5/site_perl/5.10.0/Sys/SigAction.pm line 145. > Compilation failed in require at ./maker line 45. > BEGIN failed--compilation aborted at ./maker line 45. > > Someone have some idea about this error? > > Best regards > Luiz > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 10 13:20:15 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 10 Jul 2017 13:20:15 -0600 Subject: [maker-devel] Maker protein match & tandem similar genes In-Reply-To: References: <6AE52D93-3F7D-4389-BE7B-BD3033F3F316@mit.edu> <6DBD3F6E-C2DA-47AB-AF6B-DEA396789C79@gmail.com> <7164F2B4-FD8A-466F-B7CA-A1BEEF98223F@mit.edu>