From marc.hoeppner at imbim.uu.se Wed Oct 1 00:39:21 2014 From: marc.hoeppner at imbim.uu.se (=?Windows-1252?Q?Marc_H=F6ppner?=) Date: Wed, 1 Oct 2014 05:39:21 +0000 Subject: [maker-devel] URGENT: Re: maker failure with example data In-Reply-To: References: Message-ID: Another possibility could be that MPICH2 wasn?t build properly, no? I remember something with enabling shared libraries during the compilation of mpich, without which the error below would appear. /Marc Marc P. Hoeppner, PhD Team Leader BILS Genome Annotation Platform Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se On 30 Sep 2014, at 21:33, Carson Holt > wrote: The message is warning that there are multiple instances of MAKER running, but no MPI communication. When you build MAKER (perl Build.PL step when installing MAKER), you need to specify the location of 'mpicc' and 'mpi.h' to build with MPI support. Otherwise you won't be able to link against MPICH2 shared libraries. You probably need to rerun that step. --Carson From: Goutham atla > Date: Tuesday, September 30, 2014 at 10:49 AM To: Carson Holt > Cc: "maker-devel at yandell-lab.org" > Subject: URGENT: Re: maker failure with example data Hi Carson, I figured out the problem is with RepeatMasker installation and I fixed it. I am running maker with MPICH2 and I get the following warning when I start it: STATUS: Processing and indexing input FASTA files... WARNING: Multiple MAKER processes have been started in the same directory. I would like to if this is common. Regards, Goutham On Tue, Sep 30, 2014 at 12:02 PM, Goutham atla > wrote: Dear Carson, Thank you for the reply. I reinstalled the BioPerl and now I am getting the following error on test data. ERROR: RepeatMasker failed --> rank=NA, hostname=motif ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 On Mon, Sep 29, 2014 at 8:17 PM, Carson Holt > wrote: The error is caused by the BioPerl indexer returning an empty length for the indexed fasta sequence (possibly because of a corrupt index file or other reasons). You may need to reinstall BioPerl (use the CPAN version not the BioPerl-live version), or reinstall Berkley DB (used by the BioPerl indexer), or reinstall the Perl module DB_File via CPAN (Perl's interface to Berkley DB). After reinstalling BioPerl, delete the mpi_blastdb directory for the MAKER run before retrying. Also verify that the /tmp directory on your system or the directory pointed to by TMP= in the maker_opts,ctl file is not full and that TMP= is not set to an NFS mounted location. Thanks, Carson From: Goutham atla > Date: Monday, September 29, 2014 at 6:33 AM To: > Subject: maker failure with example data Dear All, I am running maker with the demo file, i.e dip_contig.fasta by keeping all other parameters in .ctl files as default. But it do not progress and shows the following message that the length of the sequence is 0. Can anybody help me ? --Next Contig-- MAKER WARNING: All old files will be erased before continuing #--------------------------------------------------------------------- Skipping the contig because it is too short!! SeqID: contig-dpp-500-500 Length: 0 #--------------------------------------------------------------------- Regards, Goutham -- Goutham Atla -- Goutham Atla _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.thon at gmail.com Wed Oct 1 08:29:13 2014 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 1 Oct 2014 15:29:13 +0200 Subject: [maker-devel] change log Message-ID: Hi - Is there a change log that will show me what has changed from version 2.31.5 to 2.31.6? Thanks From carson.holt at genetics.utah.edu Wed Oct 1 10:23:15 2014 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Wed, 1 Oct 2014 15:23:15 +0000 Subject: [maker-devel] URGENT: Re: maker failure with example data In-Reply-To: References:

Message-ID: Dear All, Thank you. I figured out th problem is with mpich2. I was behind mpich2 but was unsuccessful. I installed mpich v3 and its working fine now. Thank you all. The old GMDO tutorials are bit misleading as the new versions have come up. On Wed, Oct 1, 2014 at 11:09 AM, Marc H?ppner wrote: > Another possibility could be that MPICH2 wasn?t build properly, no? I > remember something with enabling shared libraries during the compilation of > mpich, without which the error below would appear. > > /Marc > > Marc P. Hoeppner, PhD > Team Leader > BILS Genome Annotation Platform > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > On 30 Sep 2014, at 21:33, Carson Holt > wrote: > > The message is warning that there are multiple instances of MAKER > running, but no MPI communication. When you build MAKER (perl Build.PL step > when installing MAKER), you need to specify the location of 'mpicc' and > 'mpi.h' to build with MPI support. Otherwise you won't be able to link > against MPICH2 shared libraries. You probably need to rerun that step. > > --Carson > > > From: Goutham atla > Date: Tuesday, September 30, 2014 at 10:49 AM > To: Carson Holt > Cc: "maker-devel at yandell-lab.org" > Subject: URGENT: Re: maker failure with example data > > Hi Carson, > > I figured out the problem is with RepeatMasker installation and I fixed > it. > > I am running maker with MPICH2 and I get the following warning when I > start it: > > > > *STATUS: Processing and indexing input FASTA files... WARNING: Multiple > MAKER processes have been started in the same directory.* > > I would like to if this is common. > > Regards, > Goutham > > > On Tue, Sep 30, 2014 at 12:02 PM, Goutham atla > wrote: > >> Dear Carson, >> >> Thank you for the reply. I reinstalled the BioPerl and now I am getting >> the following error on test data. >> >> ERROR: RepeatMasker failed >> --> rank=NA, hostname=motif >> ERROR: Failed while doing repeat masking >> ERROR: Chunk failed at level:0, tier_type:1 >> FAILED CONTIG:contig-dpp-500-500 >> >> On Mon, Sep 29, 2014 at 8:17 PM, Carson Holt < >> carson.holt at genetics.utah.edu> wrote: >> >>> The error is caused by the BioPerl indexer returning an empty length >>> for the indexed fasta sequence (possibly because of a corrupt index file or >>> other reasons). You may need to reinstall BioPerl (use the CPAN version >>> not the BioPerl-live version), or reinstall Berkley DB (used by the BioPerl >>> indexer), or reinstall the Perl module DB_File via CPAN (Perl's interface >>> to Berkley DB). After reinstalling BioPerl, delete the mpi_blastdb >>> directory for the MAKER run before retrying. >>> >>> Also verify that the /tmp directory on your system or the directory >>> pointed to by TMP= in the maker_opts,ctl file is not full and that TMP= is >>> not set to an NFS mounted location. >>> >>> Thanks, >>> Carson >>> >>> >>> >>> >>> From: Goutham atla >>> Date: Monday, September 29, 2014 at 6:33 AM >>> To: >>> Subject: maker failure with example data >>> >>> Dear All, >>> >>> I am running maker with the demo file, i.e dip_contig.fasta by keeping >>> all other parameters in .ctl files as default. But it do not progress and >>> shows the following message that the length of the sequence is 0. Can >>> anybody help me ? >>> >>> >>> >>> --Next Contig-- >>> >>> MAKER WARNING: All old files will be erased before continuing >>> #--------------------------------------------------------------------- >>> Skipping the contig because it is too short!! >>> SeqID: contig-dpp-500-500 >>> Length: 0 >>> #--------------------------------------------------------------------- >>> >>> >>> Regards, >>> Goutham >>> >> >> >> >> -- >> Goutham Atla >> > > > > -- > Goutham Atla > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- Goutham Atla -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.zoller at env.ethz.ch Wed Oct 1 07:51:55 2014 From: stefan.zoller at env.ethz.ch (Stefan Zoller) Date: Wed, 1 Oct 2014 14:51:55 +0200 Subject: [maker-devel] diff. numbers of geneson contigs vs. scaffolded genome In-Reply-To: References: <541BCE0A.70806@env.ethz.ch> <7A60AB257EFF2B48B1F4C814817EA0537B651ADF@mxb1.hg.genetics.utah.edu> <5421695F.5040409@env.ethz.ch> Message-ID: <542BF8EB.7090800@env.ethz.ch> Hi Carson Thanks again for your help and suggestions. They are very helpful indeed! I have now: 1) created a species specific repeat library, or actually several versions (e.g., filtered for hits on known plant transposable elements etc., or filtering out hits on proper plant proteins), and ran Maker with it on a subset of the genome. Whatever version of repeat library I use, I get +/- 5% the same number of Maker approved proteins. I get slightly more proteins with the "best" species specific repeat library, so I think it does make a difference, however not a big one. Interestingly, if I turn off the repeat masking totally, I get about 20% more Maker approved protein models. So either I am doing something totally wrong here or the repeat masking is working quite well with the specific repeat libraries. 2) filtered the non-overlapping ab-initio proteins with PFAM domains according to your how-to. This works very nicely, thanks. However, I get quite a lot of models with PFAM hits, even when stringently filtering for e-value. For example, in the subset of the contiged genome I usually get around 300 Maker models. And now I have an additional 180 from the "non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20. For e-value < 1e-10 it would be 280, almost twice the number of proteins. Extrapolating this to the full genome, this would be more than 32'000 proteins. This seems a bit excessive and I am not sure if I am even supposed to use such a stringent e-value filtering. One reason of having so many additional proteins I can think of, is that augustus and snap are predicting similar non-overlapping models for the same location and of course they then both have a PFAM domain. I can actually see this for some locations when I load the data in WebApollo. I can think of a crude way to select only the "best" model for a location (while preferably also considering the already Maker approved protein) but I wonder if maybe there is already a solution for this in Maker? In short, I think the repeat masking seems not to be the problem (And I think I have put quite some effort in the repeat library creation). On the other hand, there are a lot of "good" models in the non-overlapping proteins that could be filtered and promoted to proper models, if I only could make the right selection. Maybe, based on these additional informations you could point out additional tests, filtering approaches or analyses I could do to home-in to the "good" gene models in the non-overlapping gene models (or Maker approved gene models in general). Thanks again for your help! Stefan On 25.09.14 20:17, Carson Holt wrote: > Sorry for the slow reply. I was trying to locate a script that might be > useful for you. > > I think a species specific repeat libary will be of most benefit here > (it's surprising just how influential this step is). Also note that you > should train SNAP and Augustus on your species and are not just using > another related species as a stand in. > > With respect to PFAM domains, on some organisms you may not get a lot of > cross species protein alignments because of divergence or assembly issues. > This of course makes it harder to support these models with direct protein > alignments. However you can run InterProscan over the > non-overlapping.proteins.fasta file produced by MAKER (contains > non-redundant rejected models). Because an HMM is used for domain > identification, it can pick up protein domains that would not produce a > significant BLAST alignment because of divergence. You can then add models > with positive hits for protein domains back into your gene set. > > This ad hoc procedure usually can only increase gene counts by about 10% > though for organisms where it's required. I've attached a script that > makes generating results for these genes easier. > > 1. First you run InterProScan with just PFAM. > 2. Then you take the IDs of all models that have a domain in the report > and create a list (1 ID per line). > 3. Next use the fasta_tool script that comes with MAKER together with the > --select flag to separate just the positive hits (ID's in your list) from > the non-overlapping.proteins.fasta and non-overlapping.transscripts.fasta > files. > 4. Use the attached script to separate just the positive hits (your ID > list) from the GFF3. The script will upgrade match/match_part results to > gene/mRNA/exon/CDS results and print them out for you. > 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to > merge the selected/upgraded GFF3 entries and selected FASTA entries back > into the original MAKER results. > > --Carson > > > > On 9/23/14, 6:36 AM, "Stefan Zoller" wrote: > >> Please forgive my ignorance, I am not entirely sure if I understand your >> question correctly, but I will try to answer. >> As evidence we use: >> 1) our own transcriptome (trinity assembled RNAseq, filtering out the >> very low expression transcripts). >> 2) all swissprot plant proteins, and several protein sets from closely >> related plant species downloaded from NCBI. >> I am not sure if the ab-initio predictions without evidence have pfamm >> domains. Honestly, I would not know how to tell and how to >> include/exclude. >> I was assuming that we should not have too many Maker approved >> predictions without evidence anyway, because we use "keeps_preds=0". >> The numbers of gene predictions I mentioned in my email are the >> predictions reported by the fasta_merge/gff3_merge scripts in the >> "*maker.proteins.fasta". There are of course many more predictions in >> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this file). >> >> I hope I am not totally off with my answer. >> Cheers, Stefan >> >> >> >> On 23.09.14 02:10, Mark Yandell wrote: >>> Also are you numbers including the ab-inito predictions without >>> evidence that have pfamm domains? >>> >>> cheers, >>> >>> >>> --mark >>> >>> >>> >>> Mark Yandell >>> Professor of Human Genetics >>> H.A. & Edna Benning Presidential Endowed Chair >>> Co-director USTAR Center for Genetic Discovery >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> ph:801-587-7707 >>> >>> ________________________________________ >>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>> Carson Holt [carson.holt at genetics.utah.edu] >>> Sent: Monday, September 22, 2014 2:17 PM >>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org >>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. >>> scaffolded genome >>> >>> The contiged assembly is more likely to give spurious hits and >>> alignments. >>> They also can be harder to repeat mask. Also gene predictors can >>> behave >>> slightly different on small sequences than on longer ones. If you have >>> fewer gene models than you expect, your first step should be to process >>> the scaffolds with CEGMA. It will give you an estimate of the genomes >>> "completeness". If CEGMA gives a 60% completeness value for example >>> then >>> you can expect to only recover 60% of the expected number of genes. Next >>> you should run RepeatModeler of similar software to help generate a >>> species specific repeat library. Under masked repeats can make >>> predicting >>> genes on longer scaffolds far more difficult for ab initio predictors. >>> >>> --Carson >>> >>> >>> On 9/19/14, 12:32 AM, "Stefan Zoller" wrote: >>> >>>> Hi, >>>> >>>> I am working on the annotation of a plant genome (about 600MB) and we >>>> have a reasonable draft assembly, a fairly good transcriptome and quite >>>> a few proteins from related species. We have also extensively trained >>>> augustus and are also feeding genmark and snap predictions. >>>> >>>> Recently I noticed a behavior of Maker that seems fairly odd and which >>>> I >>>> cannot explain at all. When I take the scaffolded genome (about 23000 >>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is >>>> admittedly a bit on the low side and we have to work on this. However, >>>> when I break up the scaffolds into contigs at stretches of N longer >>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now >>>> obviously 17'000 is more in the range what I would expect, so I am >>>> inclined to go with these. I have looked at both annotations and the >>>> evidence in WebApollo and the evidence alignments are identical for >>>> both >>>> runs. The approved genes seem to be the same, except for the additional >>>> ones in the "contiged" genome version. The additional gene models are >>>> not necessarily at the ends of the contigs, so I think it has nothing >>>> to >>>> do with having the stretches of Ns nearby in the scaffolded genome. Do >>>> you have any idea why maker comes up with the additional numbers of >>>> gene >>>> models and how I could "convince" maker to give me the same gene models >>>> for the scaffolded assembly? >>>> >>>> Cheers, >>>> Stefan >>>> >>>> >>>> >>>> -- >>>> Stefan Zoller, PhD >>>> Bioinformatics >>>> Genetic Diversity Centre >>>> ETH Zurich CHN E55.1 >>>> Universit?tsstrasse 16 >>>> 8092 Zurich >>>> Switzerland >>>> >>>> Phone: +41 44 632 66 85 >>>> E-Mail: stefan.zoller at env.ethz.ch >>>> Web: www.gdc.ethz.ch >>>> >>>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Stefan Zoller, PhD Bioinformatics Genetic Diversity Centre ETH Zurich CHN E55.1 Universit?tsstrasse 16 8092 Zurich Switzerland Phone: +41 44 632 66 85 E-Mail: stefan.zoller at env.ethz.ch Web: www.gdc.ethz.ch From carson.holt at genetics.utah.edu Wed Oct 1 14:18:43 2014 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Wed, 1 Oct 2014 19:18:43 +0000 Subject: [maker-devel] diff. numbers of geneson contigs vs. scaffolded genome In-Reply-To: <542C40D1.3070300@env.ethz.ch> References: <541BCE0A.70806@env.ethz.ch> <7A60AB257EFF2B48B1F4C814817EA0537B651ADF@mxb1.hg.genetics.utah.edu> <5421695F.5040409@env.ethz.ch> <542BF8EB.7090800@env.ethz.ch> <542C40D1.3070300@env.ethz.ch> Message-ID: --> Should I filter them by e-value or some other parameter before promoting them to an "approved" status? If it's the e-value, what threshold would be preferable? Given the lack of evidence from aligned proteins or ESTs (and the fact that ab initio predictors over predict so much), I don't put much stock in the e-values. Without some form of evidence supporting them, they are all pretty much just as likely as any other. The PFAM domain at least provides an independent form of evidence support. One thing to note is that some genomes have low gene counts because of assembly errors. You can get a good CEGMA score because the conserved genes CEGMA looks at are very very short compared to most genes, but then because of assembly issues long genes don't appear well. In cases like these you are more likely to end up with fragmented gene models relative to true gene model. The honeybee genome is an example. They went from ~10,000 genes to ~15,000 on the reannotation after improving both their repeat database and fixing certain assembly issues. Thanks, Carson On 10/1/14, 11:58 AM, "Stefan Zoller" wrote: >Thanks for the swift answer. I just add a few clarifications below, >because I might have omitted some information. >On 01.10.14 18:20, Carson Holt wrote: >>> 1) created a species specific repeat library, or actually several >>> versions (e.g., filtered for hits on known plant transposable elements >>> etc., or filtering out hits on proper plant proteins), and ran Maker >>> with it on a subset of the genome. Whatever version of repeat library I >>> use, I get +/- 5% the same number of Maker approved proteins. I get >>> slightly more proteins with the "best" species specific repeat library, >>> so I think it does make a difference, however not a big one. >>> Interestingly, if I turn off the repeat masking totally, I get about >>>20% >>> more Maker approved protein models. So either I am doing something >>> totally wrong here or the repeat masking is working quite well with the >>> specific repeat libraries. >> You expect more proteins if you turn all repeat masking off because >> transposons encode real proteins and there will be a lot of them. Some >> plant species for example have inflated gene counts because they failed >>to >> properly remove transposons during annotation, and removing these false >> models is actually a major goal of many reannotation projects. Also >> because transposons can occur in the middle of a gene or in an intron, >>not >> masking them can actually cause the predictor to not call the >>surrounding >> genes (what you are really interested in), but rather you just a series >>of >> transposons. Try using RepeatModeler to build the repeat dataset. It >>is >> not so much that you only want repeats from your species in the dataset >>so >> much as it is adding any novel repeats that will not be in any dataset. >> For example, I normally run will all of RepBase together with the novel >> repeats identified by RepeatModeler. You want to find everything you >>can. >I have used RepeatModeler and LTRharvester and MITE and have then >filtered the combined dataset to remove "real" plant proteins that got >in there accidentally. I am quite happy with the result. And in Maker I >am also using including the repeat libraries of other plant species. So >I am pretty much following your advice. >>> 2) filtered the non-overlapping ab-initio proteins with PFAM domains >>> according to your how-to. This works very nicely, thanks. However, I >>>get >>> quite a lot of models with PFAM hits, even when stringently filtering >>> for e-value. For example, in the subset of the contiged genome I >>>usually >>> get around 300 Maker models. And now I have an additional 180 from the >>> "non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20. >>> For e-value < 1e-10 it would be 280, almost twice the number of >>> proteins. Extrapolating this to the full genome, this would be more >>>than >>> 32'000 proteins. This seems a bit excessive and I am not sure if I am >>> even supposed to use such a stringent e-value filtering. One reason of >>> having so many additional proteins I can think of, is that augustus and >>> snap are predicting similar non-overlapping models for the same >>>location >>> and of course they then both have a PFAM domain. I can actually see >>>this >>> for some locations when I load the data in WebApollo. I can think of a >>> crude way to select only the "best" model for a location (while >>> preferably also considering the already Maker approved protein) but I >>> wonder if maybe there is already a solution for this in Maker? >> The non-overlapping ab-initio proteins are already non-redundant. They >> will not overlap each other or any of the genes already called by MAKER. >> Also make sure you have identified novel repeats for your species or >>these >> models will be full of transposons which WILL have PFAM domains. Just >> reading the names of identified domains lets you know if it's a repeat >> related protein. Also you must have your gene predictors trained on >>your >> species. You cannot use a related species as your model if trying to >>add >> genes via PFAM domain content. This is because you will get fragmented >> gene models from the predictors if you are using a related species, and >> since there is no overlapping evidence alignment to help correct for >>this >> (these are the unsupported models after all), then you will be adding >>very >> poor models back in. >OK, I was not aware of these models not overlapping each other. I must >have looked at the wrong models in WebApollo then. The old Apollo was so >much easier to set up... >I had a look at the names in the interproscan output and less than 5% of >all the models with domains have a name which is clearly repeat-related >(e.g., PPR repeat, or G-beta repeat). >I have also spent a lot of time on training Augustus and SNAP on our >species. Especially the Augustus predictions look rather good I think. >So also here I am following rather closely your advice. And I must say I >am VERY grateful for the extensive help and advice you offer, because, >being almost a one-man-show, it would not be possible for me to do all >this work without it. > >In the end the "mystery" of having different numbers of models in the >scaffolded vs. contiged genome is partially solved or at least explained. >One thing that you could maybe give a quick answer: I will go ahead and >select some of the non-overlapping ab-initio proteins with PFAM domains. >Should I filter them by e-value or some other parameter before promoting >them to an "approved" status? If it's the e-value, what threshold would >be preferable? > >Thanks again! >Stefan > >> >> Thanks, >> Carson >> >> >> >> >> >> >> >> >> >> >>> In short, I think the repeat masking seems not to be the problem (And I >>> think I have put quite some effort in the repeat library creation). On >>> the other hand, there are a lot of "good" models in the non-overlapping >>> proteins that could be filtered and promoted to proper models, if I >>>only >>> could make the right selection. >>> >>> Maybe, based on these additional informations you could point out >>> additional tests, filtering approaches or analyses I could do to >>>home-in >>> to the "good" gene models in the non-overlapping gene models (or Maker >>> approved gene models in general). >>> >>> Thanks again for your help! >>> Stefan >>> >>> >>> >>> On 25.09.14 20:17, Carson Holt wrote: >>>> Sorry for the slow reply. I was trying to locate a script that might >>>>be >>>> useful for you. >>>> >>>> I think a species specific repeat libary will be of most benefit here >>>> (it's surprising just how influential this step is). Also note that >>>>you >>>> should train SNAP and Augustus on your species and are not just using >>>> another related species as a stand in. >>>> >>>> With respect to PFAM domains, on some organisms you may not get a lot >>>>of >>>> cross species protein alignments because of divergence or assembly >>>> issues. >>>> This of course makes it harder to support these models with direct >>>> protein >>>> alignments. However you can run InterProscan over the >>>> non-overlapping.proteins.fasta file produced by MAKER (contains >>>> non-redundant rejected models). Because an HMM is used for domain >>>> identification, it can pick up protein domains that would not produce >>>>a >>>> significant BLAST alignment because of divergence. You can then add >>>> models >>>> with positive hits for protein domains back into your gene set. >>>> >>>> This ad hoc procedure usually can only increase gene counts by about >>>>10% >>>> though for organisms where it's required. I've attached a script that >>>> makes generating results for these genes easier. >>>> >>>> 1. First you run InterProScan with just PFAM. >>>> 2. Then you take the IDs of all models that have a domain in the >>>>report >>>> and create a list (1 ID per line). >>>> 3. Next use the fasta_tool script that comes with MAKER together with >>>> the >>>> --select flag to separate just the positive hits (ID's in your list) >>>> from >>>> the non-overlapping.proteins.fasta and >>>> non-overlapping.transscripts.fasta >>>> files. >>>> 4. Use the attached script to separate just the positive hits (your ID >>>> list) from the GFF3. The script will upgrade match/match_part results >>>>to >>>> gene/mRNA/exon/CDS results and print them out for you. >>>> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to >>>> merge the selected/upgraded GFF3 entries and selected FASTA entries >>>>back >>>> into the original MAKER results. >>>> >>>> --Carson >>>> >>>> >>>> >>>> On 9/23/14, 6:36 AM, "Stefan Zoller" >>>>wrote: >>>> >>>>> Please forgive my ignorance, I am not entirely sure if I understand >>>>> your >>>>> question correctly, but I will try to answer. >>>>> As evidence we use: >>>>> 1) our own transcriptome (trinity assembled RNAseq, filtering out the >>>>> very low expression transcripts). >>>>> 2) all swissprot plant proteins, and several protein sets from >>>>>closely >>>>> related plant species downloaded from NCBI. >>>>> I am not sure if the ab-initio predictions without evidence have >>>>>pfamm >>>>> domains. Honestly, I would not know how to tell and how to >>>>> include/exclude. >>>>> I was assuming that we should not have too many Maker approved >>>>> predictions without evidence anyway, because we use "keeps_preds=0". >>>>> The numbers of gene predictions I mentioned in my email are the >>>>> predictions reported by the fasta_merge/gff3_merge scripts in the >>>>> "*maker.proteins.fasta". There are of course many more predictions in >>>>> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this >>>>> file). >>>>> >>>>> I hope I am not totally off with my answer. >>>>> Cheers, Stefan >>>>> >>>>> >>>>> >>>>> On 23.09.14 02:10, Mark Yandell wrote: >>>>>> Also are you numbers including the ab-inito predictions without >>>>>> evidence that have pfamm domains? >>>>>> >>>>>> cheers, >>>>>> >>>>>> >>>>>> --mark >>>>>> >>>>>> >>>>>> >>>>>> Mark Yandell >>>>>> Professor of Human Genetics >>>>>> H.A. & Edna Benning Presidential Endowed Chair >>>>>> Co-director USTAR Center for Genetic Discovery >>>>>> Eccles Institute of Human Genetics >>>>>> University of Utah >>>>>> 15 North 2030 East, Room 2100 >>>>>> Salt Lake City, UT 84112-5330 >>>>>> ph:801-587-7707 >>>>>> >>>>>> ________________________________________ >>>>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>>>>> Carson Holt [carson.holt at genetics.utah.edu] >>>>>> Sent: Monday, September 22, 2014 2:17 PM >>>>>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org >>>>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. >>>>>> scaffolded genome >>>>>> >>>>>> The contiged assembly is more likely to give spurious hits and >>>>>> alignments. >>>>>> They also can be harder to repeat mask. Also gene predictors >>>>>>can >>>>>> behave >>>>>> slightly different on small sequences than on longer ones. If you >>>>>> have >>>>>> fewer gene models than you expect, your first step should be to >>>>>> process >>>>>> the scaffolds with CEGMA. It will give you an estimate of the >>>>>>genomes >>>>>> "completeness". If CEGMA gives a 60% completeness value for example >>>>>> then >>>>>> you can expect to only recover 60% of the expected number of genes. >>>>>> Next >>>>>> you should run RepeatModeler of similar software to help generate a >>>>>> species specific repeat library. Under masked repeats can make >>>>>> predicting >>>>>> genes on longer scaffolds far more difficult for ab initio >>>>>>predictors. >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 9/19/14, 12:32 AM, "Stefan Zoller" >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am working on the annotation of a plant genome (about 600MB) and >>>>>>>we >>>>>>> have a reasonable draft assembly, a fairly good transcriptome and >>>>>>> quite >>>>>>> a few proteins from related species. We have also extensively >>>>>>>trained >>>>>>> augustus and are also feeding genmark and snap predictions. >>>>>>> >>>>>>> Recently I noticed a behavior of Maker that seems fairly odd and >>>>>>> which >>>>>>> I >>>>>>> cannot explain at all. When I take the scaffolded genome (about >>>>>>>23000 >>>>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is >>>>>>> admittedly a bit on the low side and we have to work on this. >>>>>>> However, >>>>>>> when I break up the scaffolds into contigs at stretches of N longer >>>>>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. >>>>>>> Now >>>>>>> obviously 17'000 is more in the range what I would expect, so I am >>>>>>> inclined to go with these. I have looked at both annotations and >>>>>>>the >>>>>>> evidence in WebApollo and the evidence alignments are identical for >>>>>>> both >>>>>>> runs. The approved genes seem to be the same, except for the >>>>>>> additional >>>>>>> ones in the "contiged" genome version. The additional gene models >>>>>>>are >>>>>>> not necessarily at the ends of the contigs, so I think it has >>>>>>>nothing >>>>>>> to >>>>>>> do with having the stretches of Ns nearby in the scaffolded genome. >>>>>>> Do >>>>>>> you have any idea why maker comes up with the additional numbers of >>>>>>> gene >>>>>>> models and how I could "convince" maker to give me the same gene >>>>>>> models >>>>>>> for the scaffolded assembly? >>>>>>> >>>>>>> Cheers, >>>>>>> Stefan >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Stefan Zoller, PhD >>>>>>> Bioinformatics >>>>>>> Genetic Diversity Centre >>>>>>> ETH Zurich CHN E55.1 >>>>>>> Universit?tsstrasse 16 >>>>>>> 8092 Zurich >>>>>>> Switzerland >>>>>>> >>>>>>> Phone: +41 44 632 66 85 >>>>>>> E-Mail: stefan.zoller at env.ethz.ch >>>>>>> Web: www.gdc.ethz.ch >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o >>>>>>rg >>> -- >>> Stefan Zoller, PhD >>> Bioinformatics >>> Genetic Diversity Centre >>> ETH Zurich CHN E55.1 >>> Universit?tsstrasse 16 >>> 8092 Zurich >>> Switzerland >>> >>> Phone: +41 44 632 66 85 >>> E-Mail: stefan.zoller at env.ethz.ch >>> Web: www.gdc.ethz.ch >>> > >-- >Stefan Zoller, PhD >Bioinformatics >Genetic Diversity Centre >ETH Zurich CHN E55.1 >Universit?tsstrasse 16 >8092 Zurich >Switzerland > >Phone: +41 44 632 66 85 >E-Mail: stefan.zoller at env.ethz.ch >Web: www.gdc.ethz.ch > From adf at ncgr.org Thu Oct 2 14:28:16 2014 From: adf at ncgr.org (Andrew Farmer) Date: Thu, 02 Oct 2014 13:28:16 -0600 Subject: [maker-devel] question regarding MAKER determination of CDS boundaries Message-ID: <542DA750.4@ncgr.org> Hi all- several months ago, our group used MAKER-P (version 2.30) to annotate some draft genome assemblies, and have since been working a bit more closely evaluating the predicted gene models in an effort to get them ready for public release. One of the things that we recently noticed during this process is that a considerable proportion (~%10) of the peptides predicted do not begin with start codons. Initially, my guess was that this was simply due to assembly gaps causing truncations (and this may be a partial explanation) but I was surprised to see many of them with 5' UTRs reported- about half of the proteins beginning without a start codon report a 5'UTR of length 0, while the rest of have 5'UTR lengths reported in a range from a few bp to several kb in length. Having dug in a little deeper on the supporting evidence for one example, one plausible explanation seems to be that the choice of CDS start has been influenced by an outlier in the protein alignments (ie one protein whose alignment start extends a little further upstream than all of the others, which ). Before I spend more time trying to reverse engineer the diagnosis of other examples, it seemed worth sending the list a message to see if this seems plausible, or maybe there is a simpler explanation for it that I've overlooked. I can send more specific details on my example case if it would be helpful. thanks in advance for your insights/suggestions Andrew Farmer -- ...all concepts in which an entire process is semiotically concentrated elude definition; only that which has no history is definable. Friedrich Nietzsche From jmdoyle at purdue.edu Thu Oct 2 14:32:53 2014 From: jmdoyle at purdue.edu (Doyle, Jacqueline R M) Date: Thu, 2 Oct 2014 19:32:53 +0000 Subject: [maker-devel] maker-devel Digest, Vol 77, Issue 4 In-Reply-To: References: Message-ID: <6443E29C5ACAAD449704DD0385BAF754127FAF59@WPVEXCMBX03.purdue.lcl> Hi Carson! If you have them readily available, what are the citations for the honeybee genome manuscripts you referenced (below)? I imagine one is the 2006 Nature paper and one something more recent? Thanks! Jackie -----Original Message----- From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of maker-devel-request at yandell-lab.org Sent: Thursday, October 2, 2014 2:00 PM To: maker-devel at yandell-lab.org Subject: maker-devel Digest, Vol 77, Issue 4 Send maker-devel mailing list submissions to maker-devel at yandell-lab.org To subscribe or unsubscribe via the World Wide Web, visit http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org or, via email, send a message with subject or body 'help' to maker-devel-request at yandell-lab.org You can reach the person managing the list at maker-devel-owner at yandell-lab.org When replying, please edit your Subject line so it is more specific than "Re: Contents of maker-devel digest..." Today's Topics: 1. Re: diff. numbers of geneson contigs vs. scaffolded genome (Carson Holt) ---------------------------------------------------------------------- Message: 1 Date: Wed, 1 Oct 2014 19:18:43 +0000 From: Carson Holt To: "stefan.zoller at env.ethz.ch" , "maker-devel at yandell-lab.org" , Mark Yandell Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. scaffolded genome Message-ID: Content-Type: text/plain; charset="utf-8" --> Should I filter them by e-value or some other parameter before promoting them to an "approved" status? If it's the e-value, what threshold would be preferable? Given the lack of evidence from aligned proteins or ESTs (and the fact that ab initio predictors over predict so much), I don't put much stock in the e-values. Without some form of evidence supporting them, they are all pretty much just as likely as any other. The PFAM domain at least provides an independent form of evidence support. One thing to note is that some genomes have low gene counts because of assembly errors. You can get a good CEGMA score because the conserved genes CEGMA looks at are very very short compared to most genes, but then because of assembly issues long genes don't appear well. In cases like these you are more likely to end up with fragmented gene models relative to true gene model. The honeybee genome is an example. They went from ~10,000 genes to ~15,000 on the reannotation after improving both their repeat database and fixing certain assembly issues. Thanks, Carson On 10/1/14, 11:58 AM, "Stefan Zoller" wrote: >Thanks for the swift answer. I just add a few clarifications below, >because I might have omitted some information. >On 01.10.14 18:20, Carson Holt wrote: >>> 1) created a species specific repeat library, or actually several >>>versions (e.g., filtered for hits on known plant transposable >>>elements etc., or filtering out hits on proper plant proteins), and >>>ran Maker with it on a subset of the genome. Whatever version of >>>repeat library I use, I get +/- 5% the same number of Maker approved >>>proteins. I get slightly more proteins with the "best" species >>>specific repeat library, so I think it does make a difference, however not a big one. >>> Interestingly, if I turn off the repeat masking totally, I get about >>>20% more Maker approved protein models. So either I am doing >>>something totally wrong here or the repeat masking is working quite >>>well with the specific repeat libraries. >> You expect more proteins if you turn all repeat masking off because >>transposons encode real proteins and there will be a lot of them. Some >>plant species for example have inflated gene counts because they >>failed to properly remove transposons during annotation, and removing >>these false >> models is actually a major goal of many reannotation projects. Also >> because transposons can occur in the middle of a gene or in an >>intron, not masking them can actually cause the predictor to not call >>the surrounding genes (what you are really interested in), but rather >>you just a series of transposons. Try using RepeatModeler to build >>the repeat dataset. It is not so much that you only want repeats >>from your species in the dataset so much as it is adding any novel >>repeats that will not be in any dataset. >> For example, I normally run will all of RepBase together with the >>novel repeats identified by RepeatModeler. You want to find >>everything you can. >I have used RepeatModeler and LTRharvester and MITE and have then >filtered the combined dataset to remove "real" plant proteins that got >in there accidentally. I am quite happy with the result. And in Maker I >am also using including the repeat libraries of other plant species. So >I am pretty much following your advice. >>> 2) filtered the non-overlapping ab-initio proteins with PFAM domains >>>according to your how-to. This works very nicely, thanks. However, I >>>get quite a lot of models with PFAM hits, even when stringently >>>filtering for e-value. For example, in the subset of the contiged >>>genome I usually get around 300 Maker models. And now I have an >>>additional 180 from the "non-overlapping-with-PFAM-domain" when >>>filtering for e-value <1e-20. >>> For e-value < 1e-10 it would be 280, almost twice the number of >>>proteins. Extrapolating this to the full genome, this would be more >>>than >>> 32'000 proteins. This seems a bit excessive and I am not sure if I >>>am even supposed to use such a stringent e-value filtering. One >>>reason of having so many additional proteins I can think of, is that >>>augustus and snap are predicting similar non-overlapping models for >>>the same location and of course they then both have a PFAM domain. I >>>can actually see this for some locations when I load the data in >>>WebApollo. I can think of a crude way to select only the "best" >>>model for a location (while preferably also considering the already >>>Maker approved protein) but I wonder if maybe there is already a >>>solution for this in Maker? >> The non-overlapping ab-initio proteins are already non-redundant. >>They will not overlap each other or any of the genes already called by MAKER. >> Also make sure you have identified novel repeats for your species or >>these models will be full of transposons which WILL have PFAM >>domains. Just reading the names of identified domains lets you know >>if it's a repeat related protein. Also you must have your gene >>predictors trained on your species. You cannot use a related species >>as your model if trying to add genes via PFAM domain content. This >>is because you will get fragmented gene models from the predictors if >>you are using a related species, and since there is no overlapping >>evidence alignment to help correct for this (these are the >>unsupported models after all), then you will be adding very poor >>models back in. >OK, I was not aware of these models not overlapping each other. I must >have looked at the wrong models in WebApollo then. The old Apollo was >so much easier to set up... >I had a look at the names in the interproscan output and less than 5% >of all the models with domains have a name which is clearly >repeat-related (e.g., PPR repeat, or G-beta repeat). >I have also spent a lot of time on training Augustus and SNAP on our >species. Especially the Augustus predictions look rather good I think. >So also here I am following rather closely your advice. And I must say >I am VERY grateful for the extensive help and advice you offer, >because, being almost a one-man-show, it would not be possible for me >to do all this work without it. > >In the end the "mystery" of having different numbers of models in the >scaffolded vs. contiged genome is partially solved or at least explained. >One thing that you could maybe give a quick answer: I will go ahead and >select some of the non-overlapping ab-initio proteins with PFAM domains. >Should I filter them by e-value or some other parameter before >promoting them to an "approved" status? If it's the e-value, what >threshold would be preferable? > >Thanks again! >Stefan > >> >> Thanks, >> Carson >> >> >> >> >> >> >> >> >> >> >>> In short, I think the repeat masking seems not to be the problem >>>(And I think I have put quite some effort in the repeat library >>>creation). On the other hand, there are a lot of "good" models in >>>the non-overlapping proteins that could be filtered and promoted to >>>proper models, if I only could make the right selection. >>> >>> Maybe, based on these additional informations you could point out >>>additional tests, filtering approaches or analyses I could do to >>>home-in to the "good" gene models in the non-overlapping gene models >>>(or Maker approved gene models in general). >>> >>> Thanks again for your help! >>> Stefan >>> >>> >>> >>> On 25.09.14 20:17, Carson Holt wrote: >>>> Sorry for the slow reply. I was trying to locate a script that >>>>might be useful for you. >>>> >>>> I think a species specific repeat libary will be of most benefit >>>>here (it's surprising just how influential this step is). Also >>>>note that you should train SNAP and Augustus on your species and >>>>are not just using another related species as a stand in. >>>> >>>> With respect to PFAM domains, on some organisms you may not get a >>>>lot of cross species protein alignments because of divergence or >>>>assembly issues. >>>> This of course makes it harder to support these models with direct >>>>protein alignments. However you can run InterProscan over the >>>>non-overlapping.proteins.fasta file produced by MAKER (contains >>>>non-redundant rejected models). Because an HMM is used for domain >>>>identification, it can pick up protein domains that would not >>>>produce a significant BLAST alignment because of divergence. You >>>>can then add models with positive hits for protein domains back >>>>into your gene set. >>>> >>>> This ad hoc procedure usually can only increase gene counts by >>>>about 10% though for organisms where it's required. I've attached a >>>>script that makes generating results for these genes easier. >>>> >>>> 1. First you run InterProScan with just PFAM. >>>> 2. Then you take the IDs of all models that have a domain in the >>>>report and create a list (1 ID per line). >>>> 3. Next use the fasta_tool script that comes with MAKER together >>>>with the --select flag to separate just the positive hits (ID's in >>>>your list) from the non-overlapping.proteins.fasta and >>>>non-overlapping.transscripts.fasta >>>> files. >>>> 4. Use the attached script to separate just the positive hits (your >>>>ID >>>> list) from the GFF3. The script will upgrade match/match_part >>>>results to gene/mRNA/exon/CDS results and print them out for you. >>>> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER >>>>to merge the selected/upgraded GFF3 entries and selected FASTA >>>>entries back into the original MAKER results. >>>> >>>> --Carson >>>> >>>> >>>> >>>> On 9/23/14, 6:36 AM, "Stefan Zoller" >>>>wrote: >>>> >>>>> Please forgive my ignorance, I am not entirely sure if I >>>>>understand your question correctly, but I will try to answer. >>>>> As evidence we use: >>>>> 1) our own transcriptome (trinity assembled RNAseq, filtering out >>>>>the very low expression transcripts). >>>>> 2) all swissprot plant proteins, and several protein sets from >>>>>closely related plant species downloaded from NCBI. >>>>> I am not sure if the ab-initio predictions without evidence have >>>>>pfamm domains. Honestly, I would not know how to tell and how to >>>>>include/exclude. >>>>> I was assuming that we should not have too many Maker approved >>>>>predictions without evidence anyway, because we use "keeps_preds=0". >>>>> The numbers of gene predictions I mentioned in my email are the >>>>>predictions reported by the fasta_merge/gff3_merge scripts in the >>>>>"*maker.proteins.fasta". There are of course many more predictions >>>>>in e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in >>>>>this file). >>>>> >>>>> I hope I am not totally off with my answer. >>>>> Cheers, Stefan >>>>> >>>>> >>>>> >>>>> On 23.09.14 02:10, Mark Yandell wrote: >>>>>> Also are you numbers including the ab-inito predictions without >>>>>> evidence that have pfamm domains? >>>>>> >>>>>> cheers, >>>>>> >>>>>> >>>>>> --mark >>>>>> >>>>>> >>>>>> >>>>>> Mark Yandell >>>>>> Professor of Human Genetics >>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR >>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics >>>>>> University of Utah >>>>>> 15 North 2030 East, Room 2100 >>>>>> Salt Lake City, UT 84112-5330 >>>>>> ph:801-587-7707 >>>>>> >>>>>> ________________________________________ >>>>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf >>>>>> of Carson Holt [carson.holt at genetics.utah.edu] >>>>>> Sent: Monday, September 22, 2014 2:17 PM >>>>>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org >>>>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. >>>>>> scaffolded genome >>>>>> >>>>>> The contiged assembly is more likely to give spurious hits and >>>>>>alignments. >>>>>> They also can be harder to repeat mask. Also gene predictors >>>>>>can behave slightly different on small sequences than on longer >>>>>>ones. If you have fewer gene models than you expect, your first >>>>>>step should be to process the scaffolds with CEGMA. It will >>>>>>give you an estimate of the genomes "completeness". If CEGMA >>>>>>gives a 60% completeness value for example then you can expect >>>>>>to only recover 60% of the expected number of genes. >>>>>> Next >>>>>> you should run RepeatModeler of similar software to help generate >>>>>>a species specific repeat library. Under masked repeats can make >>>>>>predicting genes on longer scaffolds far more difficult for ab >>>>>>initio predictors. >>>>>> >>>>>> --Carson >>>>>> >>>>>> >>>>>> On 9/19/14, 12:32 AM, "Stefan Zoller" >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am working on the annotation of a plant genome (about 600MB) >>>>>>>and we have a reasonable draft assembly, a fairly good >>>>>>>transcriptome and quite a few proteins from related species. We >>>>>>>have also extensively trained augustus and are also feeding >>>>>>>genmark and snap predictions. >>>>>>> >>>>>>> Recently I noticed a behavior of Maker that seems fairly odd and >>>>>>>which I cannot explain at all. When I take the scaffolded >>>>>>>genome (about >>>>>>>23000 >>>>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which >>>>>>>is admittedly a bit on the low side and we have to work on this. >>>>>>> However, >>>>>>> when I break up the scaffolds into contigs at stretches of N >>>>>>>longer 500bp (about 60'000 contigs) I get about 17'000 maker gene models. >>>>>>> Now >>>>>>> obviously 17'000 is more in the range what I would expect, so I >>>>>>>am inclined to go with these. I have looked at both annotations >>>>>>>and the evidence in WebApollo and the evidence alignments are >>>>>>>identical for both runs. The approved genes seem to be the >>>>>>>same, except for the additional ones in the "contiged" genome >>>>>>>version. The additional gene models are not necessarily at the >>>>>>>ends of the contigs, so I think it has nothing to do with >>>>>>>having the stretches of Ns nearby in the scaffolded genome. >>>>>>> Do >>>>>>> you have any idea why maker comes up with the additional numbers >>>>>>>of gene models and how I could "convince" maker to give me the >>>>>>>same gene models for the scaffolded assembly? >>>>>>> >>>>>>> Cheers, >>>>>>> Stefan >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Stefan Zoller, PhD >>>>>>> Bioinformatics >>>>>>> Genetic Diversity Centre >>>>>>> ETH Zurich CHN E55.1 >>>>>>> Universit?tsstrasse 16 >>>>>>> 8092 Zurich >>>>>>> Switzerland >>>>>>> >>>>>>> Phone: +41 44 632 66 85 >>>>>>> E-Mail: stefan.zoller at env.ethz.ch >>>>>>> Web: www.gdc.ethz.ch >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> maker-devel mailing list >>>>>> maker-devel at box290.bluehost.com >>>>>> >>>>>> >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la >>>>>>b.o >>>>>>rg >>> -- >>> Stefan Zoller, PhD >>> Bioinformatics >>> Genetic Diversity Centre >>> ETH Zurich CHN E55.1 >>> Universit?tsstrasse 16 >>> 8092 Zurich >>> Switzerland >>> >>> Phone: +41 44 632 66 85 >>> E-Mail: stefan.zoller at env.ethz.ch >>> Web: www.gdc.ethz.ch >>> > >-- >Stefan Zoller, PhD >Bioinformatics >Genetic Diversity Centre >ETH Zurich CHN E55.1 >Universit?tsstrasse 16 >8092 Zurich >Switzerland > >Phone: +41 44 632 66 85 >E-Mail: stefan.zoller at env.ethz.ch >Web: www.gdc.ethz.ch > ------------------------------ Subject: Digest Footer _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org ------------------------------ End of maker-devel Digest, Vol 77, Issue 4 ****************************************** From carsonhh at gmail.com Thu Oct 2 14:52:44 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 02 Oct 2014 13:52:44 -0600 Subject: [maker-devel] question regarding MAKER determination of CDS boundaries In-Reply-To: <542DA750.4@ncgr.org> References: <542DA750.4@ncgr.org> Message-ID: There can be three sources of non-M starting transcripts. 1 Partial models that do not have a start. 2. The ab initio gene predictors themselves can pick an alternate non-canonical start (this is rare). 3. The default BioPerl codon table has alternate start codons and these can return 'true' when you test if a codon is a start codon before adding UTR or if you used the always_complet=1 option (If you get non-canonical starts with UTR then this is the most likely source). The current versions of MAKER (2.31+) exports a 'strict' canonical codon table to BioPerl (overriding the default table with alternate start). This will force start locations identified on extended transcripts to be only 'M'. You can rerun your annotations on a current version of MAKER or just pass in you previous transcripts via GFF3 to have it recalculate the ORF if you have an odd number of alternate starts from a previous version of MAKER when you used the always_complet=1 option. --Carson On 10/2/14, 1:28 PM, "Andrew Farmer" wrote: >Hi all- >several months ago, our group used MAKER-P (version 2.30) to annotate >some draft genome assemblies, >and have since been working a bit more closely evaluating the predicted >gene models in an effort to get them >ready for public release. One of the things that we recently noticed >during this process is that a considerable proportion >(~%10) of the peptides predicted do not begin with start codons. >Initially, my guess was that this was simply due >to assembly gaps causing truncations (and this may be a partial >explanation) but I was surprised to see many of >them with 5' UTRs reported- about half of the proteins beginning without >a start codon report a 5'UTR of length 0, >while the rest of have 5'UTR lengths reported in a range from a few bp >to several kb in length. > >Having dug in a little deeper on the supporting evidence for one >example, one plausible explanation seems >to be that the choice of CDS start has been influenced by an outlier in >the protein alignments (ie one protein whose >alignment start extends a little further upstream than all of the >others, which ). Before I spend more time trying >to reverse engineer the diagnosis of other examples, it seemed worth >sending the list a message to see if this >seems plausible, or maybe there is a simpler explanation for it that >I've overlooked. I can send more specific >details on my example case if it would be helpful. > >thanks in advance for your insights/suggestions > >Andrew Farmer > >-- >...all concepts in which an entire process is semiotically concentrated >elude definition; only that which has no history is definable. > >Friedrich Nietzsche > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From adf at ncgr.org Thu Oct 2 15:06:27 2014 From: adf at ncgr.org (Andrew Farmer) Date: Thu, 02 Oct 2014 14:06:27 -0600 Subject: [maker-devel] question regarding MAKER determination of CDS boundaries In-Reply-To: References: <542DA750.4@ncgr.org> Message-ID: <542DB043.1080007@ncgr.org> Thanks Carson- indeed, I was using always_complete=1, thinking that would be most appropriate for the current state of the genome assemblies. I see now that this question has come up a few times before on the list, sorry not to have thought to search through the list archives before posting the query yet again. But thanks for the additional suggestion on the recalculation approach, that sounds straightforward. regards Andrew On 10/2/14 1:52 PM, Carson Holt wrote: > There can be three sources of non-M starting transcripts. > > 1 Partial models that do not have a start. > 2. The ab initio gene predictors themselves can pick an alternate > non-canonical start (this is rare). > 3. The default BioPerl codon table has alternate start codons and these > can return 'true' when you test if a codon is a start codon before adding > UTR or if you used the always_complet=1 option (If you get non-canonical > starts with UTR then this is the most likely source). > > > The current versions of MAKER (2.31+) exports a 'strict' canonical codon > table to BioPerl (overriding the default table with alternate start). > This will force start locations identified on extended transcripts to be > only 'M'. You can rerun your annotations on a current version of MAKER or > just pass in you previous transcripts via GFF3 to have it recalculate the > ORF if you have an odd number of alternate starts from a previous version > of MAKER when you used the always_complet=1 option. > > --Carson > > > On 10/2/14, 1:28 PM, "Andrew Farmer" wrote: > >> Hi all- >> several months ago, our group used MAKER-P (version 2.30) to annotate >> some draft genome assemblies, >> and have since been working a bit more closely evaluating the predicted >> gene models in an effort to get them >> ready for public release. One of the things that we recently noticed >> during this process is that a considerable proportion >> (~%10) of the peptides predicted do not begin with start codons. >> Initially, my guess was that this was simply due >> to assembly gaps causing truncations (and this may be a partial >> explanation) but I was surprised to see many of >> them with 5' UTRs reported- about half of the proteins beginning without >> a start codon report a 5'UTR of length 0, >> while the rest of have 5'UTR lengths reported in a range from a few bp >> to several kb in length. >> >> Having dug in a little deeper on the supporting evidence for one >> example, one plausible explanation seems >> to be that the choice of CDS start has been influenced by an outlier in >> the protein alignments (ie one protein whose >> alignment start extends a little further upstream than all of the >> others, which ). Before I spend more time trying >> to reverse engineer the diagnosis of other examples, it seemed worth >> sending the list a message to see if this >> seems plausible, or maybe there is a simpler explanation for it that >> I've overlooked. I can send more specific >> details on my example case if it would be helpful. >> >> thanks in advance for your insights/suggestions >> >> Andrew Farmer >> >> -- >> ...all concepts in which an entire process is semiotically concentrated >> elude definition; only that which has no history is definable. >> >> Friedrich Nietzsche >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- ...all concepts in which an entire process is semiotically concentrated elude definition; only that which has no history is definable. Friedrich Nietzsche From Timothy.Stitt at tgac.ac.uk Sat Oct 4 10:14:45 2014 From: Timothy.Stitt at tgac.ac.uk (Timothy Stitt (TGAC)) Date: Sat, 4 Oct 2014 15:14:45 +0000 Subject: [maker-devel] Maker Bio::Root Error Message-ID: Dear Maker Developers, One of my Maker users is observing the following error when running maker on our systems: ------------- EXCEPTION: Bio::Root::BadParameter ------------- MSG: ' 9.1' is not a valid score VALUE: 9.1 STACK: Error::throw STACK: Bio::Root::Root::throw /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/Root/Root.pm:449 STACK: Bio::SeqFeature::Generic::score /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/SeqFeature/Generic.pm:468 STACK: GFFDB::_ary_to_features /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:891 STACK: GFFDB::phathits_on_chunk /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:534 STACK: Process::MpiChunk::_go /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:756 STACK: Process::MpiChunk::run /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:341 STACK: Process::MpiChunk::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:357 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:287 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:287 STACK: /tgac/software/testing/bin/core/../..//maker/2.31.6/x86_64/bin/maker:686 -------------------------------------------------------------- --> rank=NA, hostname=UV00000010-P002 ERROR: Failed while doing repeat masking When the user runs with the '-RM_off' option, everything is fine but fails with the above error when not applying that option. I was just wondering if anyone had any insight into what might be causing this? Regards, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Oct 5 17:58:09 2014 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 05 Oct 2014 16:58:09 -0600 Subject: [maker-devel] Maker Bio::Root Error Message-ID: The location of the error is when MAKER tries to read a user provided GFF3 file, and then BioPerl is saying one of the values is invalid. Looking at the single quotes around the value, it appears that there is some contaminating whitespace. There may be other problems with the GFF3 file as well. I could take look if you want. Thanks, Carson From: "Timothy Stitt (TGAC)" Date: Saturday, October 4, 2014 at 9:14 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Maker Bio::Root Error Dear Maker Developers, One of my Maker users is observing the following error when running maker on our systems: ------------- EXCEPTION: Bio::Root::BadParameter ------------- MSG: ' 9.1' is not a valid score VALUE: 9.1 STACK: Error::throw STACK: Bio::Root::Root::throw /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/Root/ Root.pm:449 STACK: Bio::SeqFeature::Generic::score /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/SeqFe ature/Generic.pm:468 STACK: GFFDB::_ary_to_features /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:891 STACK: GFFDB::phathits_on_chunk /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:534 STACK: Process::MpiChunk::_go /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:75 6 STACK: Process::MpiChunk::run /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:34 1 STACK: Process::MpiChunk::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:35 7 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:28 7 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:28 7 STACK: /tgac/software/testing/bin/core/../..//maker/2.31.6/x86_64/bin/maker:686 -------------------------------------------------------------- --> rank=NA, hostname=UV00000010-P002 ERROR: Failed while doing repeat masking When the user runs with the '-RM_off' option, everything is fine but fails with the above error when not applying that option. I was just wondering if anyone had any insight into what might be causing this? Regards, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Mon Oct 6 18:28:36 2014 From: sjackman at gmail.com (Shaun Jackman) Date: Mon, 6 Oct 2014 16:28:36 -0700 Subject: [maker-devel] tbl2asn errors In-Reply-To: <0D54878997A4B9478F03938D61DB51D4266C1E@001FSN2MPN1-015.001f.mgd2.msft.net> References: <0D54878997A4B9478F03938D61DB51D4266B6B@001FSN2MPN1-015.001f.mgd2.msft.net> <0D54878997A4B9478F03938D61DB51D4266C1E@001FSN2MPN1-015.001f.mgd2.msft.net> Message-ID: Hi, Scott, Carson. What's currently the best/easiest way to convert a MAKER GFF to GenBank TBL format, and what's the state of your GAG tool, Scott? Cheers, Shaun *http://sjackman.ca * On 17 April 2014 15:37, Geib, Scott wrote: > Just so not to be discouraged, current version has limited functionality > and is pretty much un-documented (although will write a .tbl file). Will > email the list when first real release is complete and documented. > > Scott > > > > > > > > *From:* Carson Holt [mailto:carsonhh at gmail.com] > *Sent:* Thursday, April 17, 2014 11:28 AM > *To:* Geib, Scott; Mack, Brian; maker-devel at yandell-lab.org; Brian Hall ( > bhall7 at hawaii.edu) > > *Subject:* Re: [maker-devel] tbl2asn errors > > > > Very cool. I'll try it out as well. > > > > --Carson > > > > *From: *"Geib, Scott" > *Date: *Thursday, April 17, 2014 at 2:59 PM > *To: *"Mack, Brian" , " > maker-devel at yandell-lab.org" , "Brian Hall ( > bhall7 at hawaii.edu)" > *Subject: *Re: [maker-devel] tbl2asn errors > > > > Hi Brian, > > We have a tool to deal with this in development, you should not directly > upload your maker output to NCBI, you need to filter out genes, check that > things are sane, etc. > > http://brianreallymany.github.io/GAG/ > > It is still in active development, first full release is planned for the > end of this month (if you can wait 1.5 weeks). It has no dependencies and > maintains parent/child relationships (for example if you remove a gene, it > will also remove associated CDS/mRNA). In a release planned for then end > of the month, you will be able to perform functions like removing short > features, long features, flagging things for review, etc. It also generates > an updated genome.fasta file, gff3 file, and sequences files for > CDS/mRNA/peptide based on edits made. Hopefully this is helpful to you. > > > Scott > > > > ---------- Forwarded message ---------- > From: *Mack, Brian* > Date: Thu, Apr 17, 2014 at 10:34 AM > Subject: [maker-devel] tbl2asn errors > To: " " > > > Hi, I thought I would try asking my question here as NCBI was not able > to give me much assistance. In preparation for submitting to NCBI, I > converted my my MAKER gff3 to NCBI tbl format using the gff32tbl script > that Carson posted a link to in this thread ( > http://gmod.827538.n3.nabble.com/NCBI-feature-table-tt4040473.html#a4040475). > It seemed to have converted fine, however when I use NCBIs tbl2asn program > I get numerous errors in my errorsummary.val file: > > > > 4 ERROR: SEQ_FEAT.BadTrailingCharacter > > 217 ERROR: SEQ_FEAT.NoStop > > 438 ERROR: SEQ_FEAT.ShortIntron > > 171 ERROR: SEQ_FEAT.StartCodon > > 171 ERROR: SEQ_INST.BadProteinStart > > 291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor > > 648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor > > 118 WARNING: SEQ_FEAT.ShortExon > > > > In addition, all of the genes, cds, and mRNA coordinates in the resulting > sqn files are decreased by one. For example my tbl file will have gene > coordinates of 440869 ? 441931, but the sqn file will have 440868 ? 441930. > Any ideas what might be causing this? > > > > Thanks, > > Brian > > > > > > This electronic message contains information generated by the USDA solely > for the intended recipients. Any unauthorized interception of this message > or the use or disclosure of the information it contains may violate the law > and subject the violator to civil or criminal penalties. If you believe you > have received this message in error, please notify the sender and delete > the email immediately. > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > _______________________________________________ maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Oct 6 22:53:18 2014 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 7 Oct 2014 03:53:18 +0000 Subject: [maker-devel] tbl2asn errors In-Reply-To: <0D54878997A4B9478F03938D61DB51D4266B6B@001FSN2MPN1-015.001f.mgd2.msft.net> References: <0D54878997A4B9478F03938D61DB51D4266B6B@001FSN2MPN1-015.001f.mgd2.msft.net> Message-ID: Hi Scott, Just FYI, github is giving me a 404 error on the link below. Were others able to follow the link successfully? B Barry Moore ------------------------------------------------- Director, Research & Science USTAR Center for Genetic Discovery Dept. of Human Genetics University of Utah Salt Lake City, UT T: (801) 858-9476 C: (801) 243-8819 On Apr 17, 2014, at 2:59 PM, Geib, Scott > wrote: Hi Brian, We have a tool to deal with this in development, you should not directly upload your maker output to NCBI, you need to filter out genes, check that things are sane, etc. http://brianreallymany.github.io/GAG/ It is still in active development, first full release is planned for the end of this month (if you can wait 1.5 weeks). It has no dependencies and maintains parent/child relationships (for example if you remove a gene, it will also remove associated CDS/mRNA). In a release planned for then end of the month, you will be able to perform functions like removing short features, long features, flagging things for review, etc. It also generates an updated genome.fasta file, gff3 file, and sequences files for CDS/mRNA/peptide based on edits made. Hopefully this is helpful to you. Scott ---------- Forwarded message ---------- From: Mack, Brian > Date: Thu, Apr 17, 2014 at 10:34 AM Subject: [maker-devel] tbl2asn errors To: " " > Hi, I thought I would try asking my question here as NCBI was not able to give me much assistance. In preparation for submitting to NCBI, I converted my my MAKER gff3 to NCBI tbl format using the gff32tbl script that Carson posted a link to in this thread (http://gmod.827538.n3.nabble.com/NCBI-feature-table-tt4040473.html#a4040475). It seemed to have converted fine, however when I use NCBIs tbl2asn program I get numerous errors in my errorsummary.val file: 4 ERROR: SEQ_FEAT.BadTrailingCharacter 217 ERROR: SEQ_FEAT.NoStop 438 ERROR: SEQ_FEAT.ShortIntron 171 ERROR: SEQ_FEAT.StartCodon 171 ERROR: SEQ_INST.BadProteinStart 291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 118 WARNING: SEQ_FEAT.ShortExon In addition, all of the genes, cds, and mRNA coordinates in the resulting sqn files are decreased by one. For example my tbl file will have gene coordinates of 440869 ? 441931, but the sqn file will have 440868 ? 441930. Any ideas what might be causing this? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From bhall7 at hawaii.edu Mon Oct 6 23:02:48 2014 From: bhall7 at hawaii.edu (Brian Hall) Date: Mon, 06 Oct 2014 18:02:48 -1000 Subject: [maker-devel] tbl2asn errors In-Reply-To: References: <0D54878997A4B9478F03938D61DB51D4266B6B@001FSN2MPN1-015.001f.mgd2.msft.net> Message-ID: <543365E8.3020809@hawaii.edu> Hi Barry, Try this one: http://genomeannotation.github.io/GAG/ Sorry about that! --Brian Hall On 10/06/2014 05:53 PM, Barry Moore wrote: > Hi Scott, > > Just FYI, github is giving me a 404 error on the link below. Were > others able to follow the link successfully? > > B > > Barry Moore > ------------------------------------------------- > Director, Research & Science > USTAR Center for Genetic Discovery > Dept. of Human Genetics > University of Utah > Salt Lake City, UT > T: (801) 858-9476 > C: (801) 243-8819 > > On Apr 17, 2014, at 2:59 PM, Geib, Scott > wrote: > >> Hi Brian, >> >> We have a tool to deal with this in development, you should not >> directly upload your maker output to NCBI, you need to filter out >> genes, check that things are sane, etc. >> >> http://brianreallymany.github.io/GAG/ >> >> It is still in active development, first full release is planned for >> the end of this month (if you can wait 1.5 weeks). It has no >> dependencies and maintains parent/child relationships (for example if >> you remove a gene, it will also remove associated CDS/mRNA). In a >> release planned for then end of the month, you will be able to >> perform functions like removing short features, long features, >> flagging things for review, etc. It also generates an updated >> genome.fasta file, gff3 file, and sequences files for >> CDS/mRNA/peptide based on edits made. Hopefully this is helpful to you. >> >> >> Scott >> >> ---------- Forwarded message ---------- >> From:*Mack, Brian*> > >> Date: Thu, Apr 17, 2014 at 10:34 AM >> Subject: [maker-devel] tbl2asn errors >> To: "" > >> >> Hi, I thought I would try asking my question here as NCBI was not >> able to give me much assistance. In preparation for submitting to >> NCBI, I converted my my MAKER gff3 to NCBI tbl format using the >> gff32tbl script that Carson posted a link to in this thread >> (http://gmod.827538.n3.nabble.com/NCBI-feature-table-tt4040473.html#a4040475). >> It seemed to have converted fine, however when I use NCBIs tbl2asn >> program I get numerous errors in my errorsummary.val file: >> 4 ERROR: SEQ_FEAT.BadTrailingCharacter >> 217 ERROR: SEQ_FEAT.NoStop >> 438 ERROR: SEQ_FEAT.ShortIntron >> 171 ERROR: SEQ_FEAT.StartCodon >> 171 ERROR: SEQ_INST.BadProteinStart >> 291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor >> 648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor >> 118 WARNING: SEQ_FEAT.ShortExon >> In addition, all of the genes, cds, and mRNA coordinates in the >> resulting sqn files are decreased by one. For example my tbl file >> will have gene coordinates of 440869 ? 441931, but the sqn file will >> have 440868 ? 441930. Any ideas what might be causing this? >> Thanks, >> Brian >> >> >> >> >> This electronic message contains information generated by the USDA >> solely for the intended recipients. Any unauthorized interception of >> this message or the use or disclosure of the information it contains >> may violate the law and subject the violator to civil or criminal >> penalties. If you believe you have received this message in error, >> please notify the sender and delete the email immediately. >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Oct 6 23:06:37 2014 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 7 Oct 2014 04:06:37 +0000 Subject: [maker-devel] tbl2asn errors In-Reply-To: <543365E8.3020809@hawaii.edu> References: <0D54878997A4B9478F03938D61DB51D4266B6B@001FSN2MPN1-015.001f.mgd2.msft.net> <543365E8.3020809@hawaii.edu> Message-ID: <977667CA-A7F0-492C-A6E5-E0B6CD712D0C@genetics.utah.edu> Cool, thanks Brian, B Barry Moore ------------------------------------------------- Director, Research & Science USTAR Center for Genetic Discovery Dept. of Human Genetics University of Utah Salt Lake City, UT T: (801) 858-9476 C: (801) 243-8819 On Oct 6, 2014, at 10:02 PM, Brian Hall > wrote: Hi Barry, Try this one: http://genomeannotation.github.io/GAG/ Sorry about that! --Brian Hall On 10/06/2014 05:53 PM, Barry Moore wrote: Hi Scott, Just FYI, github is giving me a 404 error on the link below. Were others able to follow the link successfully? B Barry Moore ------------------------------------------------- Director, Research & Science USTAR Center for Genetic Discovery Dept. of Human Genetics University of Utah Salt Lake City, UT T: (801) 858-9476 C: (801) 243-8819 On Apr 17, 2014, at 2:59 PM, Geib, Scott > wrote: Hi Brian, We have a tool to deal with this in development, you should not directly upload your maker output to NCBI, you need to filter out genes, check that things are sane, etc. http://brianreallymany.github.io/GAG/ It is still in active development, first full release is planned for the end of this month (if you can wait 1.5 weeks). It has no dependencies and maintains parent/child relationships (for example if you remove a gene, it will also remove associated CDS/mRNA). In a release planned for then end of the month, you will be able to perform functions like removing short features, long features, flagging things for review, etc. It also generates an updated genome.fasta file, gff3 file, and sequences files for CDS/mRNA/peptide based on edits made. Hopefully this is helpful to you. Scott ---------- Forwarded message ---------- From: Mack, Brian > Date: Thu, Apr 17, 2014 at 10:34 AM Subject: [maker-devel] tbl2asn errors To: " " > Hi, I thought I would try asking my question here as NCBI was not able to give me much assistance. In preparation for submitting to NCBI, I converted my my MAKER gff3 to NCBI tbl format using the gff32tbl script that Carson posted a link to in this thread (http://gmod.827538.n3.nabble.com/NCBI-feature-table-tt4040473.html#a4040475). It seemed to have converted fine, however when I use NCBIs tbl2asn program I get numerous errors in my errorsummary.val file: 4 ERROR: SEQ_FEAT.BadTrailingCharacter 217 ERROR: SEQ_FEAT.NoStop 438 ERROR: SEQ_FEAT.ShortIntron 171 ERROR: SEQ_FEAT.StartCodon 171 ERROR: SEQ_INST.BadProteinStart 291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 118 WARNING: SEQ_FEAT.ShortExon In addition, all of the genes, cds, and mRNA coordinates in the resulting sqn files are decreased by one. For example my tbl file will have gene coordinates of 440869 ? 441931, but the sqn file will have 440868 ? 441930. Any ideas what might be causing this? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 7 11:17:12 2014 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 07 Oct 2014 10:17:12 -0600 Subject: [maker-devel] Maker Bio::Root Error In-Reply-To: References: Message-ID: His file is not formatted correctly. Values should be tab delimited, but in several cases he has leading space characters contaminating the values. He needs to find and remove the contaminating white space. Here is the GFF3 specification just for reference --> http://www.sequenceontology.org/gff3.shtml. Here is an example perl script that could do this (cut and paste it into a file if you want)--> #!/usr/bin/perl use strict; my $file = shift; open(IN, "< $file"); while(my $line = ){ my @F = split(/\t/, $line); chomp($F[-1]); @F = map {s/^\s|\s$//g; $_} @F; print join("\t", @F)."\n"; } close(IN); Then run it as follows --> perl fixgff3_script.pl old_file.gff > new_file.gff Thanks, Carson From: "Timothy Stitt (TGAC)" Date: Tuesday, October 7, 2014 at 1:34 AM To: Carson Holt Subject: Re: [maker-devel] Maker Bio::Root Error Hi Carson, I spoke with the user and it does seem they have some confusion over what is a well-formed GFF3 file (they mention there is no example template for them to copy on the MAKER website). I am attaching the user's GFF3 file. Could you have a quick scan to determine if they are using an incorrect format? Any advice greatly received. Thanks, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt Date: Sunday, 5 October 2014 23:58 To: Timothy Stitt , "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Maker Bio::Root Error The location of the error is when MAKER tries to read a user provided GFF3 file, and then BioPerl is saying one of the values is invalid. Looking at the single quotes around the value, it appears that there is some contaminating whitespace. There may be other problems with the GFF3 file as well. I could take look if you want. Thanks, Carson From: "Timothy Stitt (TGAC)" Date: Saturday, October 4, 2014 at 9:14 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Maker Bio::Root Error Dear Maker Developers, One of my Maker users is observing the following error when running maker on our systems: ------------- EXCEPTION: Bio::Root::BadParameter ------------- MSG: ' 9.1' is not a valid score VALUE: 9.1 STACK: Error::throw STACK: Bio::Root::Root::throw /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/Root/ Root.pm:449 STACK: Bio::SeqFeature::Generic::score /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/SeqFe ature/Generic.pm:468 STACK: GFFDB::_ary_to_features /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:891 STACK: GFFDB::phathits_on_chunk /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:534 STACK: Process::MpiChunk::_go /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:75 6 STACK: Process::MpiChunk::run /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:34 1 STACK: Process::MpiChunk::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:35 7 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:28 7 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:28 7 STACK: /tgac/software/testing/bin/core/../..//maker/2.31.6/x86_64/bin/maker:686 -------------------------------------------------------------- --> rank=NA, hostname=UV00000010-P002 ERROR: Failed while doing repeat masking When the user runs with the '-RM_off' option, everything is fine but fails with the above error when not applying that option. I was just wondering if anyone had any insight into what might be causing this? Regards, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Timothy.Stitt at tgac.ac.uk Tue Oct 7 13:31:42 2014 From: Timothy.Stitt at tgac.ac.uk (Timothy Stitt (TGAC)) Date: Tue, 7 Oct 2014 18:31:42 +0000 Subject: [maker-devel] Maker Bio::Root Error In-Reply-To: References: Message-ID: Thanks Carson. Much appreciated. I've passed on your recommendations to the user and I'll let you know the outcome. Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt > Date: Tuesday, 7 October 2014 17:17 To: Timothy Stitt > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Maker Bio::Root Error His file is not formatted correctly. Values should be tab delimited, but in several cases he has leading space characters contaminating the values. He needs to find and remove the contaminating white space. Here is the GFF3 specification just for reference --> http://www.sequenceontology.org/gff3.shtml. Here is an example perl script that could do this (cut and paste it into a file if you want)--> #!/usr/bin/perl use strict; my $file = shift; open(IN, "< $file"); while(my $line = ){ my @F = split(/\t/, $line); chomp($F[-1]); @F = map {s/^\s|\s$//g; $_} @F; print join("\t", @F)."\n"; } close(IN); Then run it as follows --> perl fixgff3_script.pl old_file.gff > new_file.gff Thanks, Carson From: "Timothy Stitt (TGAC)" > Date: Tuesday, October 7, 2014 at 1:34 AM To: Carson Holt > Subject: Re: [maker-devel] Maker Bio::Root Error Hi Carson, I spoke with the user and it does seem they have some confusion over what is a well-formed GFF3 file (they mention there is no example template for them to copy on the MAKER website). I am attaching the user's GFF3 file. Could you have a quick scan to determine if they are using an incorrect format? Any advice greatly received. Thanks, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt > Date: Sunday, 5 October 2014 23:58 To: Timothy Stitt >, "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Maker Bio::Root Error The location of the error is when MAKER tries to read a user provided GFF3 file, and then BioPerl is saying one of the values is invalid. Looking at the single quotes around the value, it appears that there is some contaminating whitespace. There may be other problems with the GFF3 file as well. I could take look if you want. Thanks, Carson From: "Timothy Stitt (TGAC)" > Date: Saturday, October 4, 2014 at 9:14 AM To: "maker-devel at yandell-lab.org" > Subject: [maker-devel] Maker Bio::Root Error Dear Maker Developers, One of my Maker users is observing the following error when running maker on our systems: ------------- EXCEPTION: Bio::Root::BadParameter ------------- MSG: ' 9.1' is not a valid score VALUE: 9.1 STACK: Error::throw STACK: Bio::Root::Root::throw /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/Root/Root.pm:449 STACK: Bio::SeqFeature::Generic::score /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/SeqFeature/Generic.pm:468 STACK: GFFDB::_ary_to_features /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:891 STACK: GFFDB::phathits_on_chunk /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:534 STACK: Process::MpiChunk::_go /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:756 STACK: Process::MpiChunk::run /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:341 STACK: Process::MpiChunk::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:357 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:287 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:287 STACK: /tgac/software/testing/bin/core/../..//maker/2.31.6/x86_64/bin/maker:686 -------------------------------------------------------------- --> rank=NA, hostname=UV00000010-P002 ERROR: Failed while doing repeat masking When the user runs with the '-RM_off' option, everything is fine but fails with the above error when not applying that option. I was just wondering if anyone had any insight into what might be causing this? Regards, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Thu Oct 2 14:44:17 2014 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 2 Oct 2014 13:44:17 -0600 Subject: [maker-devel] maker-devel Digest, Vol 77, Issue 4 In-Reply-To: <6443E29C5ACAAD449704DD0385BAF754127FAF59@WPVEXCMBX03.purdue.lcl> References: <6443E29C5ACAAD449704DD0385BAF754127FAF59@WPVEXCMBX03.purdue.lcl> Message-ID: Hi Jackie, I happened to have this on handy. It was published in BMC genomics early this year. Finding the missing honey bee genes: lessons learned from a genome upgrade. Elsik CG, Worley KC, Bennett AK, Beye M, Camara F, Childers CP, de Graaf DC, Debyser G, Deng J, Devreese B, Elhaik E, Evans JD, Foster LJ, Graur D, Guigo R; HGSC production teams, Hoff KJ, Holder ME, Hudson ME, Hunt GJ, Jiang H, Joshi V, Khetani RS, Kosarev P, Kovar CL, Ma J, Maleszka R, Moritz RF, Munoz-Torres MC, Murphy TD, Muzny DM, Newsham IF, Reese JT, Robertson HM, Robinson GE, Rueppell O, Solovyev V, Stanke M, Stolle E, Tsuruda JM, Vaerenbergh MV, Waterhouse RM, Weaver DB, Whitfield CW, Wu Y, Zdobnov EM, Zhang L, Zhu D, Gibbs RA; Honey Bee Genome Sequencing Consortium. BMC Genomics. 2014 Jan 30;15:86. doi: 10.1186/1471-2164-15-86. Take care, Mike On Thu, Oct 2, 2014 at 1:32 PM, Doyle, Jacqueline R M wrote: > Hi Carson! If you have them readily available, what are the citations for > the honeybee genome manuscripts you referenced (below)? I imagine one is > the 2006 Nature paper and one something more recent? > > Thanks! Jackie > > -----Original Message----- > From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf > Of maker-devel-request at yandell-lab.org > Sent: Thursday, October 2, 2014 2:00 PM > To: maker-devel at yandell-lab.org > Subject: maker-devel Digest, Vol 77, Issue 4 > > Send maker-devel mailing list submissions to > maker-devel at yandell-lab.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > or, via email, send a message with subject or body 'help' to > maker-devel-request at yandell-lab.org > > You can reach the person managing the list at > maker-devel-owner at yandell-lab.org > > When replying, please edit your Subject line so it is more specific than > "Re: Contents of maker-devel digest..." > > > Today's Topics: > > 1. Re: diff. numbers of geneson contigs vs. scaffolded genome > (Carson Holt) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 1 Oct 2014 19:18:43 +0000 > From: Carson Holt > To: "stefan.zoller at env.ethz.ch" , > "maker-devel at yandell-lab.org" , Mark > Yandell > Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. > scaffolded genome > Message-ID: > Content-Type: text/plain; charset="utf-8" > > --> Should I filter them by e-value or some other parameter before > promoting them to an "approved" status? If it's the e-value, what > threshold would be preferable? > > > > Given the lack of evidence from aligned proteins or ESTs (and the fact > that ab initio predictors over predict so much), I don't put much stock in > the e-values. Without some form of evidence supporting them, they are all > pretty much just as likely as any other. The PFAM domain at least provides > an independent form of evidence support. > > One thing to note is that some genomes have low gene counts because of > assembly errors. You can get a good CEGMA score because the conserved > genes CEGMA looks at are very very short compared to most genes, but then > because of assembly issues long genes don't appear well. In cases like > these you are more likely to end up with fragmented gene models relative to > true gene model. > > The honeybee genome is an example. They went from ~10,000 genes to > ~15,000 on the reannotation after improving both their repeat database and > fixing certain assembly issues. > > Thanks, > Carson > > > > On 10/1/14, 11:58 AM, "Stefan Zoller" wrote: > > >Thanks for the swift answer. I just add a few clarifications below, > >because I might have omitted some information. > >On 01.10.14 18:20, Carson Holt wrote: > >>> 1) created a species specific repeat library, or actually several > >>>versions (e.g., filtered for hits on known plant transposable > >>>elements etc., or filtering out hits on proper plant proteins), and > >>>ran Maker with it on a subset of the genome. Whatever version of > >>>repeat library I use, I get +/- 5% the same number of Maker approved > >>>proteins. I get slightly more proteins with the "best" species > >>>specific repeat library, so I think it does make a difference, however > not a big one. > >>> Interestingly, if I turn off the repeat masking totally, I get about > >>>20% more Maker approved protein models. So either I am doing > >>>something totally wrong here or the repeat masking is working quite > >>>well with the specific repeat libraries. > >> You expect more proteins if you turn all repeat masking off because > >>transposons encode real proteins and there will be a lot of them. Some > >>plant species for example have inflated gene counts because they > >>failed to properly remove transposons during annotation, and removing > >>these false > >> models is actually a major goal of many reannotation projects. Also > >> because transposons can occur in the middle of a gene or in an > >>intron, not masking them can actually cause the predictor to not call > >>the surrounding genes (what you are really interested in), but rather > >>you just a series of transposons. Try using RepeatModeler to build > >>the repeat dataset. It is not so much that you only want repeats > >>from your species in the dataset so much as it is adding any novel > >>repeats that will not be in any dataset. > >> For example, I normally run will all of RepBase together with the > >>novel repeats identified by RepeatModeler. You want to find > >>everything you can. > >I have used RepeatModeler and LTRharvester and MITE and have then > >filtered the combined dataset to remove "real" plant proteins that got > >in there accidentally. I am quite happy with the result. And in Maker I > >am also using including the repeat libraries of other plant species. So > >I am pretty much following your advice. > >>> 2) filtered the non-overlapping ab-initio proteins with PFAM domains > >>>according to your how-to. This works very nicely, thanks. However, I > >>>get quite a lot of models with PFAM hits, even when stringently > >>>filtering for e-value. For example, in the subset of the contiged > >>>genome I usually get around 300 Maker models. And now I have an > >>>additional 180 from the "non-overlapping-with-PFAM-domain" when > >>>filtering for e-value <1e-20. > >>> For e-value < 1e-10 it would be 280, almost twice the number of > >>>proteins. Extrapolating this to the full genome, this would be more > >>>than > >>> 32'000 proteins. This seems a bit excessive and I am not sure if I > >>>am even supposed to use such a stringent e-value filtering. One > >>>reason of having so many additional proteins I can think of, is that > >>>augustus and snap are predicting similar non-overlapping models for > >>>the same location and of course they then both have a PFAM domain. I > >>>can actually see this for some locations when I load the data in > >>>WebApollo. I can think of a crude way to select only the "best" > >>>model for a location (while preferably also considering the already > >>>Maker approved protein) but I wonder if maybe there is already a > >>>solution for this in Maker? > >> The non-overlapping ab-initio proteins are already non-redundant. > >>They will not overlap each other or any of the genes already called by > MAKER. > >> Also make sure you have identified novel repeats for your species or > >>these models will be full of transposons which WILL have PFAM > >>domains. Just reading the names of identified domains lets you know > >>if it's a repeat related protein. Also you must have your gene > >>predictors trained on your species. You cannot use a related species > >>as your model if trying to add genes via PFAM domain content. This > >>is because you will get fragmented gene models from the predictors if > >>you are using a related species, and since there is no overlapping > >>evidence alignment to help correct for this (these are the > >>unsupported models after all), then you will be adding very poor > >>models back in. > >OK, I was not aware of these models not overlapping each other. I must > >have looked at the wrong models in WebApollo then. The old Apollo was > >so much easier to set up... > >I had a look at the names in the interproscan output and less than 5% > >of all the models with domains have a name which is clearly > >repeat-related (e.g., PPR repeat, or G-beta repeat). > >I have also spent a lot of time on training Augustus and SNAP on our > >species. Especially the Augustus predictions look rather good I think. > >So also here I am following rather closely your advice. And I must say > >I am VERY grateful for the extensive help and advice you offer, > >because, being almost a one-man-show, it would not be possible for me > >to do all this work without it. > > > >In the end the "mystery" of having different numbers of models in the > >scaffolded vs. contiged genome is partially solved or at least explained. > >One thing that you could maybe give a quick answer: I will go ahead and > >select some of the non-overlapping ab-initio proteins with PFAM domains. > >Should I filter them by e-value or some other parameter before > >promoting them to an "approved" status? If it's the e-value, what > >threshold would be preferable? > > > >Thanks again! > >Stefan > > > >> > >> Thanks, > >> Carson > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >>> In short, I think the repeat masking seems not to be the problem > >>>(And I think I have put quite some effort in the repeat library > >>>creation). On the other hand, there are a lot of "good" models in > >>>the non-overlapping proteins that could be filtered and promoted to > >>>proper models, if I only could make the right selection. > >>> > >>> Maybe, based on these additional informations you could point out > >>>additional tests, filtering approaches or analyses I could do to > >>>home-in to the "good" gene models in the non-overlapping gene models > >>>(or Maker approved gene models in general). > >>> > >>> Thanks again for your help! > >>> Stefan > >>> > >>> > >>> > >>> On 25.09.14 20:17, Carson Holt wrote: > >>>> Sorry for the slow reply. I was trying to locate a script that > >>>>might be useful for you. > >>>> > >>>> I think a species specific repeat libary will be of most benefit > >>>>here (it's surprising just how influential this step is). Also > >>>>note that you should train SNAP and Augustus on your species and > >>>>are not just using another related species as a stand in. > >>>> > >>>> With respect to PFAM domains, on some organisms you may not get a > >>>>lot of cross species protein alignments because of divergence or > >>>>assembly issues. > >>>> This of course makes it harder to support these models with direct > >>>>protein alignments. However you can run InterProscan over the > >>>>non-overlapping.proteins.fasta file produced by MAKER (contains > >>>>non-redundant rejected models). Because an HMM is used for domain > >>>>identification, it can pick up protein domains that would not > >>>>produce a significant BLAST alignment because of divergence. You > >>>>can then add models with positive hits for protein domains back > >>>>into your gene set. > >>>> > >>>> This ad hoc procedure usually can only increase gene counts by > >>>>about 10% though for organisms where it's required. I've attached a > >>>>script that makes generating results for these genes easier. > >>>> > >>>> 1. First you run InterProScan with just PFAM. > >>>> 2. Then you take the IDs of all models that have a domain in the > >>>>report and create a list (1 ID per line). > >>>> 3. Next use the fasta_tool script that comes with MAKER together > >>>>with the --select flag to separate just the positive hits (ID's in > >>>>your list) from the non-overlapping.proteins.fasta and > >>>>non-overlapping.transscripts.fasta > >>>> files. > >>>> 4. Use the attached script to separate just the positive hits (your > >>>>ID > >>>> list) from the GFF3. The script will upgrade match/match_part > >>>>results to gene/mRNA/exon/CDS results and print them out for you. > >>>> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER > >>>>to merge the selected/upgraded GFF3 entries and selected FASTA > >>>>entries back into the original MAKER results. > >>>> > >>>> --Carson > >>>> > >>>> > >>>> > >>>> On 9/23/14, 6:36 AM, "Stefan Zoller" > >>>>wrote: > >>>> > >>>>> Please forgive my ignorance, I am not entirely sure if I > >>>>>understand your question correctly, but I will try to answer. > >>>>> As evidence we use: > >>>>> 1) our own transcriptome (trinity assembled RNAseq, filtering out > >>>>>the very low expression transcripts). > >>>>> 2) all swissprot plant proteins, and several protein sets from > >>>>>closely related plant species downloaded from NCBI. > >>>>> I am not sure if the ab-initio predictions without evidence have > >>>>>pfamm domains. Honestly, I would not know how to tell and how to > >>>>>include/exclude. > >>>>> I was assuming that we should not have too many Maker approved > >>>>>predictions without evidence anyway, because we use "keeps_preds=0". > >>>>> The numbers of gene predictions I mentioned in my email are the > >>>>>predictions reported by the fasta_merge/gff3_merge scripts in the > >>>>>"*maker.proteins.fasta". There are of course many more predictions > >>>>>in e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in > >>>>>this file). > >>>>> > >>>>> I hope I am not totally off with my answer. > >>>>> Cheers, Stefan > >>>>> > >>>>> > >>>>> > >>>>> On 23.09.14 02:10, Mark Yandell wrote: > >>>>>> Also are you numbers including the ab-inito predictions without > >>>>>> evidence that have pfamm domains? > >>>>>> > >>>>>> cheers, > >>>>>> > >>>>>> > >>>>>> --mark > >>>>>> > >>>>>> > >>>>>> > >>>>>> Mark Yandell > >>>>>> Professor of Human Genetics > >>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR > >>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics > >>>>>> University of Utah > >>>>>> 15 North 2030 East, Room 2100 > >>>>>> Salt Lake City, UT 84112-5330 > >>>>>> ph:801-587-7707 > >>>>>> > >>>>>> ________________________________________ > >>>>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf > >>>>>> of Carson Holt [carson.holt at genetics.utah.edu] > >>>>>> Sent: Monday, September 22, 2014 2:17 PM > >>>>>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org > >>>>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. > >>>>>> scaffolded genome > >>>>>> > >>>>>> The contiged assembly is more likely to give spurious hits and > >>>>>>alignments. > >>>>>> They also can be harder to repeat mask. Also gene predictors > >>>>>>can behave slightly different on small sequences than on longer > >>>>>>ones. If you have fewer gene models than you expect, your first > >>>>>>step should be to process the scaffolds with CEGMA. It will > >>>>>>give you an estimate of the genomes "completeness". If CEGMA > >>>>>>gives a 60% completeness value for example then you can expect > >>>>>>to only recover 60% of the expected number of genes. > >>>>>> Next > >>>>>> you should run RepeatModeler of similar software to help generate > >>>>>>a species specific repeat library. Under masked repeats can make > >>>>>>predicting genes on longer scaffolds far more difficult for ab > >>>>>>initio predictors. > >>>>>> > >>>>>> --Carson > >>>>>> > >>>>>> > >>>>>> On 9/19/14, 12:32 AM, "Stefan Zoller" > >>>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I am working on the annotation of a plant genome (about 600MB) > >>>>>>>and we have a reasonable draft assembly, a fairly good > >>>>>>>transcriptome and quite a few proteins from related species. We > >>>>>>>have also extensively trained augustus and are also feeding > >>>>>>>genmark and snap predictions. > >>>>>>> > >>>>>>> Recently I noticed a behavior of Maker that seems fairly odd and > >>>>>>>which I cannot explain at all. When I take the scaffolded > >>>>>>>genome (about > >>>>>>>23000 > >>>>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which > >>>>>>>is admittedly a bit on the low side and we have to work on this. > >>>>>>> However, > >>>>>>> when I break up the scaffolds into contigs at stretches of N > >>>>>>>longer 500bp (about 60'000 contigs) I get about 17'000 maker gene > models. > >>>>>>> Now > >>>>>>> obviously 17'000 is more in the range what I would expect, so I > >>>>>>>am inclined to go with these. I have looked at both annotations > >>>>>>>and the evidence in WebApollo and the evidence alignments are > >>>>>>>identical for both runs. The approved genes seem to be the > >>>>>>>same, except for the additional ones in the "contiged" genome > >>>>>>>version. The additional gene models are not necessarily at the > >>>>>>>ends of the contigs, so I think it has nothing to do with > >>>>>>>having the stretches of Ns nearby in the scaffolded genome. > >>>>>>> Do > >>>>>>> you have any idea why maker comes up with the additional numbers > >>>>>>>of gene models and how I could "convince" maker to give me the > >>>>>>>same gene models for the scaffolded assembly? > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Stefan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Stefan Zoller, PhD > >>>>>>> Bioinformatics > >>>>>>> Genetic Diversity Centre > >>>>>>> ETH Zurich CHN E55.1 > >>>>>>> Universit?tsstrasse 16 > >>>>>>> 8092 Zurich > >>>>>>> Switzerland > >>>>>>> > >>>>>>> Phone: +41 44 632 66 85 > >>>>>>> E-Mail: stefan.zoller at env.ethz.ch > >>>>>>> Web: www.gdc.ethz.ch > >>>>>>> > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> maker-devel mailing list > >>>>>> maker-devel at box290.bluehost.com > >>>>>> > >>>>>> > >>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la > >>>>>>b.o > >>>>>>rg > >>> -- > >>> Stefan Zoller, PhD > >>> Bioinformatics > >>> Genetic Diversity Centre > >>> ETH Zurich CHN E55.1 > >>> Universit?tsstrasse 16 > >>> 8092 Zurich > >>> Switzerland > >>> > >>> Phone: +41 44 632 66 85 > >>> E-Mail: stefan.zoller at env.ethz.ch > >>> Web: www.gdc.ethz.ch > >>> > > > >-- > >Stefan Zoller, PhD > >Bioinformatics > >Genetic Diversity Centre > >ETH Zurich CHN E55.1 > >Universit?tsstrasse 16 > >8092 Zurich > >Switzerland > > > >Phone: +41 44 632 66 85 > >E-Mail: stefan.zoller at env.ethz.ch > >Web: www.gdc.ethz.ch > > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > ------------------------------ > > End of maker-devel Digest, Vol 77, Issue 4 > ****************************************** > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Scott.Geib at ARS.USDA.GOV Mon Oct 6 18:36:17 2014 From: Scott.Geib at ARS.USDA.GOV (Geib, Scott) Date: Mon, 6 Oct 2014 23:36:17 +0000 Subject: [maker-devel] tbl2asn errors In-Reply-To: References: <0D54878997A4B9478F03938D61DB51D4266B6B@001FSN2MPN1-015.001f.mgd2.msft.net> <0D54878997A4B9478F03938D61DB51D4266C1E@001FSN2MPN1-015.001f.mgd2.msft.net> Message-ID: <0D54878997A4B9478F03938D61DB51D43046E2@001FSN2MPN1-016.001f.mgd2.msft.net> Hi, I know Carson had a script to generate a tbl file he had posted before. If you want to do more filtering, GAG should work. If you come across any issues, please post a bug on the github page. http://genomeannotation.github.io Also, NCBI is a bit of a moving target on what their current format is that they accept. You should be able to supply a scaffold assembly, but they will have limitations on how short your CDS can be, question single exon stuff, etc. Hopefully GAG could help you get to where they are happy. If they want a contig + agp file, you will also need to split your GFF file as well (we can do, but I am not sure it is posted on the github page). Scott From: Shaun Jackman [mailto:sjackman at gmail.com] Sent: Monday, October 06, 2014 1:29 PM To: Geib, Scott Cc: Carson Holt; Mack, Brian; maker-devel at yandell-lab.org; Brian Hall (bhall7 at hawaii.edu) Subject: Re: [maker-devel] tbl2asn errors Hi, Scott, Carson. What's currently the best/easiest way to convert a MAKER GFF to GenBank TBL format, and what's the state of your GAG tool, Scott? Cheers, Shaun http://sjackman.ca On 17 April 2014 15:37, Geib, Scott > wrote: Just so not to be discouraged, current version has limited functionality and is pretty much un-documented (although will write a .tbl file). Will email the list when first real release is complete and documented. Scott From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Thursday, April 17, 2014 11:28 AM To: Geib, Scott; Mack, Brian; maker-devel at yandell-lab.org; Brian Hall (bhall7 at hawaii.edu) Subject: Re: [maker-devel] tbl2asn errors Very cool. I'll try it out as well. --Carson From: "Geib, Scott" > Date: Thursday, April 17, 2014 at 2:59 PM To: "Mack, Brian" >, "maker-devel at yandell-lab.org" >, "Brian Hall (bhall7 at hawaii.edu)" > Subject: Re: [maker-devel] tbl2asn errors Hi Brian, We have a tool to deal with this in development, you should not directly upload your maker output to NCBI, you need to filter out genes, check that things are sane, etc. http://brianreallymany.github.io/GAG/ It is still in active development, first full release is planned for the end of this month (if you can wait 1.5 weeks). It has no dependencies and maintains parent/child relationships (for example if you remove a gene, it will also remove associated CDS/mRNA). In a release planned for then end of the month, you will be able to perform functions like removing short features, long features, flagging things for review, etc. It also generates an updated genome.fasta file, gff3 file, and sequences files for CDS/mRNA/peptide based on edits made. Hopefully this is helpful to you. Scott ---------- Forwarded message ---------- From: Mack, Brian > Date: Thu, Apr 17, 2014 at 10:34 AM Subject: [maker-devel] tbl2asn errors To: " " > Hi, I thought I would try asking my question here as NCBI was not able to give me much assistance. In preparation for submitting to NCBI, I converted my my MAKER gff3 to NCBI tbl format using the gff32tbl script that Carson posted a link to in this thread (http://gmod.827538.n3.nabble.com/NCBI-feature-table-tt4040473.html#a4040475). It seemed to have converted fine, however when I use NCBIs tbl2asn program I get numerous errors in my errorsummary.val file: 4 ERROR: SEQ_FEAT.BadTrailingCharacter 217 ERROR: SEQ_FEAT.NoStop 438 ERROR: SEQ_FEAT.ShortIntron 171 ERROR: SEQ_FEAT.StartCodon 171 ERROR: SEQ_INST.BadProteinStart 291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor 648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor 118 WARNING: SEQ_FEAT.ShortExon In addition, all of the genes, cds, and mRNA coordinates in the resulting sqn files are decreased by one. For example my tbl file will have gene coordinates of 440869 ? 441931, but the sqn file will have 440868 ? 441930. Any ideas what might be causing this? Thanks, Brian This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cognitiveshrapnel at gmail.com Wed Oct 8 12:52:48 2014 From: cognitiveshrapnel at gmail.com (Justin Peyton) Date: Wed, 8 Oct 2014 13:52:48 -0400 Subject: [maker-devel] Segmentation fault on exit Message-ID: Hello, I am getting a segmentation fault on exit (below). I have read about something similar to this involving hydra. I do no think that is the situation here. I am running Maker 2.31.6 with openMPI and the -mca btl ^openib option to work around an infiniband installation (even though I have no idea what it does). My understanding is that this error has no effect on output because it happens after maker is finished. Can you confirm this? Any ideas what may be causing it? Thank you in advance Justin Peyton The Ohio State University Maker is now finished!!! Start_time: 1412784571 End_time: 1412784747 Elapsed: 176 [n0617:27212] *** Process received signal *** [n0617:27212] Signal: Segmentation fault (11) [n0617:27212] Signal code: Address not mapped (1) [n0617:27212] Failing at address: (nil) [n0617:27212] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27212] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27212] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27212] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27212] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27212] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27212] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27212] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27212] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27212] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27212] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27212] [11] /usr/bin/perl() [0x400c59] [n0617:27212] *** End of error message *** [n0617:27207] *** Process received signal *** [n0617:27207] Signal: Segmentation fault (11) [n0617:27207] Signal code: Address not mapped (1) [n0617:27207] Failing at address: (nil) [n0617:27207] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27207] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27207] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27207] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27207] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27207] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27207] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27207] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27207] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27207] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27207] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27207] [11] /usr/bin/perl() [0x400c59] [n0617:27207] *** End of error message *** [n0617:27205] *** Process received signal *** [n0617:27205] Signal: Segmentation fault (11) [n0617:27205] Signal code: Address not mapped (1) [n0617:27205] Failing at address: (nil) [n0617:27205] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27205] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27205] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27205] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27205] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27205] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27205] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27205] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27205] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27205] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27205] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27205] [11] /usr/bin/perl() [0x400c59] [n0617:27205] *** End of error message *** [n0617:27209] *** Process received signal *** [n0617:27209] Signal: Segmentation fault (11) [n0617:27209] Signal code: Address not mapped (1) [n0617:27209] Failing at address: (nil) [n0617:27209] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27209] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27209] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27209] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27209] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27209] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27209] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27209] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27209] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27209] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27209] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27209] [11] /usr/bin/perl() [0x400c59] [n0617:27209] *** End of error message *** [n0617:27215] *** Process received signal *** [n0617:27215] Signal: Segmentation fault (11) [n0617:27215] Signal code: Address not mapped (1) [n0617:27215] Failing at address: (nil) [n0617:27215] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27215] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27215] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27215] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27215] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27215] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27215] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27215] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27215] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27215] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27215] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27215] [11] /usr/bin/perl() [0x400c59] [n0617:27215] *** End of error message *** [n0617:27216] *** Process received signal *** [n0617:27216] Signal: Segmentation fault (11) [n0617:27216] Signal code: Address not mapped (1) [n0617:27216] Failing at address: (nil) [n0617:27216] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27216] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27216] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27216] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27216] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27216] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27216] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27216] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27216] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27216] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27216] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27216] [11] /usr/bin/perl() [0x400c59] [n0617:27216] *** End of error message *** [n0617:27210] *** Process received signal *** [n0617:27210] Signal: Segmentation fault (11) [n0617:27210] Signal code: Address not mapped (1) [n0617:27210] Failing at address: (nil) [n0617:27210] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27210] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27210] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27210] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27210] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27210] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27210] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27210] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27210] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27210] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27210] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27210] [11] /usr/bin/perl() [0x400c59] [n0617:27210] *** End of error message *** [n0617:27217] *** Process received signal *** [n0617:27217] Signal: Segmentation fault (11) [n0617:27217] Signal code: Address not mapped (1) [n0617:27217] Failing at address: (nil) [n0617:27217] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27217] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27217] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27217] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27217] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27217] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27217] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27217] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27217] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27217] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27217] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27217] [11] /usr/bin/perl() [0x400c59] [n0617:27217] *** End of error message *** [n0617:27213] *** Process received signal *** [n0617:27213] Signal: Segmentation fault (11) [n0617:27213] Signal code: Address not mapped (1) [n0617:27213] Failing at address: (nil) [n0617:27213] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27213] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27213] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27213] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27213] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27213] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27213] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27213] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27213] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27213] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27213] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27213] [11] /usr/bin/perl() [0x400c59] [n0617:27213] *** End of error message *** [n0617:27208] *** Process received signal *** [n0617:27208] Signal: Segmentation fault (11) [n0617:27208] Signal code: Address not mapped (1) [n0617:27208] Failing at address: (nil) [n0617:27208] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27208] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27208] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27208] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27208] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27208] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27208] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27208] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27208] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27208] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27208] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27208] [11] /usr/bin/perl() [0x400c59] [n0617:27208] *** End of error message *** [n0617:27211] *** Process received signal *** [n0617:27211] Signal: Segmentation fault (11) [n0617:27211] Signal code: Address not mapped (1) [n0617:27211] Failing at address: (nil) [n0617:27211] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27211] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27211] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27211] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27211] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27211] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27211] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27211] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27211] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27211] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27211] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27211] [11] /usr/bin/perl() [0x400c59] [n0617:27211] *** End of error message *** [n0628:09116] *** Process received signal *** [n0628:09116] Signal: Segmentation fault (11) [n0628:09116] Signal code: Address not mapped (1) [n0628:09116] Failing at address: (nil) [n0628:09113] *** Process received signal *** [n0628:09113] Signal: Segmentation fault (11) [n0628:09113] Signal code: Address not mapped (1) [n0628:09113] Failing at address: (nil) [n0628:09113] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0628:09113] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0628:09113] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0628:09113] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0628:09113] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0628:09113] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0628:09113] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0628:09113] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0628:09113] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0628:09113] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0628:09113] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0628:09113] [11] /usr/bin/perl() [0x400c59] [n0628:09113] *** End of error message *** [n0628:09116] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0628:09116] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0628:09116] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0628:09116] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0628:09116] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0628:09116] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0628:09116] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0628:09116] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0628:09116] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0628:09116] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0628:09116] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0628:09116] [11] /usr/bin/perl() [0x400c59] [n0628:09116] *** End of error message *** + date Wed Oct 8 12:12:28 EDT 2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Wed Oct 8 13:59:44 2014 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Wed, 8 Oct 2014 18:59:44 +0000 Subject: [maker-devel] Segmentation fault on exit In-Reply-To: References: Message-ID: The segmentation fault occurs after MAKER is finished, so everything is shutting down and OpenMPI is exiting. Why MPI can't shut down correctly can be for a number of reasons. 1. You can try updating OpenMPi or making sure you use the most up to date version available on your system (not uncommon for clusters to have multiple versions installed). Note that switching versions will require that you reinstall MAKER since it is compiled against the shared libraries of the version you are using. 2. Make sure you are not using a different version of mpiexec than the version of OpenMPi you installed MAKER with (check with 'which mpiexec'). This can happen on systems with multiple versions of MPI installed. 3. Segmentation faults can be caused by perl/C bindings when perl is exiting under the control of mpiexec (mpiexec actually affects how C code operates and can lead to some odd behavior when the MPI_Finalize shuts down communication). You can reinstall any perl modules that are written in C if that is the case. One example is the 'forks' module. One previous user who had a segfault on exit was able to solve it by downgrading the Proc-ProcessTable module from 0.45 to 0.44. 4. Make sure you run this command before starting mpiexec or add it to your bash profile to fix issues with shared libraries on OpenMPI. --> export LD_PRELOAD=//lib/libmpi.so Fortunately MAKER is already finished, so it won't affect your results, but it is a symptom that there is some incompatibility with a C based library and your MPI installation. Also the -mca btl ^openib options disables the OpenFabrics libraries used by OpenMPI for infiiband communication since those libraries have a known bug where failures occur whenever programs perform a system calls (I.e. every time you open an external program from within a program). An example of this would be MAKER calling BLAST or MAKER calling Augustus or SNAP. All of which immediately cause OpenMPI to blow up if you don't add the flag. --Carson From: Justin Peyton > Date: Wednesday, October 8, 2014 at 11:52 AM To: > Subject: Segmentation fault on exit Hello, I am getting a segmentation fault on exit (below). I have read about something similar to this involving hydra. I do no think that is the situation here. I am running Maker 2.31.6 with openMPI and the -mca btl ^openib option to work around an infiniband installation (even though I have no idea what it does). My understanding is that this error has no effect on output because it happens after maker is finished. Can you confirm this? Any ideas what may be causing it? Thank you in advance Justin Peyton The Ohio State University Maker is now finished!!! Start_time: 1412784571 End_time: 1412784747 Elapsed: 176 [n0617:27212] *** Process received signal *** [n0617:27212] Signal: Segmentation fault (11) [n0617:27212] Signal code: Address not mapped (1) [n0617:27212] Failing at address: (nil) [n0617:27212] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27212] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27212] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27212] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27212] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27212] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27212] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27212] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27212] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27212] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27212] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27212] [11] /usr/bin/perl() [0x400c59] [n0617:27212] *** End of error message *** [n0617:27207] *** Process received signal *** [n0617:27207] Signal: Segmentation fault (11) [n0617:27207] Signal code: Address not mapped (1) [n0617:27207] Failing at address: (nil) [n0617:27207] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27207] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27207] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27207] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27207] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27207] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27207] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27207] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27207] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27207] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27207] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27207] [11] /usr/bin/perl() [0x400c59] [n0617:27207] *** End of error message *** [n0617:27205] *** Process received signal *** [n0617:27205] Signal: Segmentation fault (11) [n0617:27205] Signal code: Address not mapped (1) [n0617:27205] Failing at address: (nil) [n0617:27205] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27205] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27205] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27205] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27205] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27205] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27205] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27205] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27205] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27205] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27205] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27205] [11] /usr/bin/perl() [0x400c59] [n0617:27205] *** End of error message *** [n0617:27209] *** Process received signal *** [n0617:27209] Signal: Segmentation fault (11) [n0617:27209] Signal code: Address not mapped (1) [n0617:27209] Failing at address: (nil) [n0617:27209] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27209] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27209] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27209] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27209] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27209] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27209] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27209] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27209] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27209] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27209] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27209] [11] /usr/bin/perl() [0x400c59] [n0617:27209] *** End of error message *** [n0617:27215] *** Process received signal *** [n0617:27215] Signal: Segmentation fault (11) [n0617:27215] Signal code: Address not mapped (1) [n0617:27215] Failing at address: (nil) [n0617:27215] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27215] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27215] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27215] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27215] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27215] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27215] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27215] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27215] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27215] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27215] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27215] [11] /usr/bin/perl() [0x400c59] [n0617:27215] *** End of error message *** [n0617:27216] *** Process received signal *** [n0617:27216] Signal: Segmentation fault (11) [n0617:27216] Signal code: Address not mapped (1) [n0617:27216] Failing at address: (nil) [n0617:27216] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27216] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27216] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27216] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27216] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27216] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27216] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27216] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27216] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27216] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27216] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27216] [11] /usr/bin/perl() [0x400c59] [n0617:27216] *** End of error message *** [n0617:27210] *** Process received signal *** [n0617:27210] Signal: Segmentation fault (11) [n0617:27210] Signal code: Address not mapped (1) [n0617:27210] Failing at address: (nil) [n0617:27210] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27210] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27210] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27210] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27210] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27210] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27210] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27210] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27210] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27210] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27210] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27210] [11] /usr/bin/perl() [0x400c59] [n0617:27210] *** End of error message *** [n0617:27217] *** Process received signal *** [n0617:27217] Signal: Segmentation fault (11) [n0617:27217] Signal code: Address not mapped (1) [n0617:27217] Failing at address: (nil) [n0617:27217] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27217] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27217] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27217] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27217] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27217] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27217] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27217] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27217] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27217] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27217] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27217] [11] /usr/bin/perl() [0x400c59] [n0617:27217] *** End of error message *** [n0617:27213] *** Process received signal *** [n0617:27213] Signal: Segmentation fault (11) [n0617:27213] Signal code: Address not mapped (1) [n0617:27213] Failing at address: (nil) [n0617:27213] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27213] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27213] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27213] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27213] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27213] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27213] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27213] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27213] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27213] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27213] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27213] [11] /usr/bin/perl() [0x400c59] [n0617:27213] *** End of error message *** [n0617:27208] *** Process received signal *** [n0617:27208] Signal: Segmentation fault (11) [n0617:27208] Signal code: Address not mapped (1) [n0617:27208] Failing at address: (nil) [n0617:27208] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27208] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27208] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27208] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27208] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27208] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27208] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27208] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27208] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27208] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27208] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27208] [11] /usr/bin/perl() [0x400c59] [n0617:27208] *** End of error message *** [n0617:27211] *** Process received signal *** [n0617:27211] Signal: Segmentation fault (11) [n0617:27211] Signal code: Address not mapped (1) [n0617:27211] Failing at address: (nil) [n0617:27211] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0617:27211] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0617:27211] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0617:27211] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0617:27211] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0617:27211] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0617:27211] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0617:27211] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0617:27211] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0617:27211] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0617:27211] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0617:27211] [11] /usr/bin/perl() [0x400c59] [n0617:27211] *** End of error message *** [n0628:09116] *** Process received signal *** [n0628:09116] Signal: Segmentation fault (11) [n0628:09116] Signal code: Address not mapped (1) [n0628:09116] Failing at address: (nil) [n0628:09113] *** Process received signal *** [n0628:09113] Signal: Segmentation fault (11) [n0628:09113] Signal code: Address not mapped (1) [n0628:09113] Failing at address: (nil) [n0628:09113] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0628:09113] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0628:09113] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0628:09113] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0628:09113] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0628:09113] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0628:09113] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0628:09113] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0628:09113] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0628:09113] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0628:09113] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0628:09113] [11] /usr/bin/perl() [0x400c59] [n0628:09113] *** End of error message *** [n0628:09116] [ 0] /lib64/libpthread.so.0() [0x39cc40f710] [n0628:09116] [ 1] /usr/lib64/perl5/CORE/libperl.so(Perl_pp_helem+0x3bd) [0x39d0ca9ded] [n0628:09116] [ 2] /usr/lib64/perl5/CORE/libperl.so(Perl_runops_standard+0x16) [0x39d0ca4b06] [n0628:09116] [ 3] /usr/lib64/perl5/CORE/libperl.so(Perl_call_sv+0x4cf) [0x39d0c4c5df] [n0628:09116] [ 4] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clear+0xb6) [0x39d0cb8dd6] [n0628:09116] [ 5] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_free2+0x52) [0x39d0cb95d2] [n0628:09116] [ 6] /usr/lib64/perl5/CORE/libperl.so() [0x39d0cae5c1] [n0628:09116] [ 7] /usr/lib64/perl5/CORE/libperl.so(Perl_sv_clean_objs+0x21) [0x39d0cae621] [n0628:09116] [ 8] /usr/lib64/perl5/CORE/libperl.so(perl_destruct+0x11d1) [0x39d0c4e901] [n0628:09116] [ 9] /usr/bin/perl(main+0xe1) [0x400e01] [n0628:09116] [10] /lib64/libc.so.6(__libc_start_main+0xfd) [0x39cc01ed1d] [n0628:09116] [11] /usr/bin/perl() [0x400c59] [n0628:09116] *** End of error message *** + date Wed Oct 8 12:12:28 EDT 2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Timothy.Stitt at tgac.ac.uk Wed Oct 8 15:13:21 2014 From: Timothy.Stitt at tgac.ac.uk (Timothy Stitt (TGAC)) Date: Wed, 8 Oct 2014 20:13:21 +0000 Subject: [maker-devel] Maker Bio::Root Error In-Reply-To: References: Message-ID: Thanks Carson?that did the trick. The user was able to format his input file correctly and all worked well. Great support, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt > Date: Tuesday, 7 October 2014 17:17 To: Timothy Stitt > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Maker Bio::Root Error His file is not formatted correctly. Values should be tab delimited, but in several cases he has leading space characters contaminating the values. He needs to find and remove the contaminating white space. Here is the GFF3 specification just for reference --> http://www.sequenceontology.org/gff3.shtml. Here is an example perl script that could do this (cut and paste it into a file if you want)--> #!/usr/bin/perl use strict; my $file = shift; open(IN, "< $file"); while(my $line = ){ my @F = split(/\t/, $line); chomp($F[-1]); @F = map {s/^\s|\s$//g; $_} @F; print join("\t", @F)."\n"; } close(IN); Then run it as follows --> perl fixgff3_script.pl old_file.gff > new_file.gff Thanks, Carson From: "Timothy Stitt (TGAC)" > Date: Tuesday, October 7, 2014 at 1:34 AM To: Carson Holt > Subject: Re: [maker-devel] Maker Bio::Root Error Hi Carson, I spoke with the user and it does seem they have some confusion over what is a well-formed GFF3 file (they mention there is no example template for them to copy on the MAKER website). I am attaching the user's GFF3 file. Could you have a quick scan to determine if they are using an incorrect format? Any advice greatly received. Thanks, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt > Date: Sunday, 5 October 2014 23:58 To: Timothy Stitt >, "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Maker Bio::Root Error The location of the error is when MAKER tries to read a user provided GFF3 file, and then BioPerl is saying one of the values is invalid. Looking at the single quotes around the value, it appears that there is some contaminating whitespace. There may be other problems with the GFF3 file as well. I could take look if you want. Thanks, Carson From: "Timothy Stitt (TGAC)" > Date: Saturday, October 4, 2014 at 9:14 AM To: "maker-devel at yandell-lab.org" > Subject: [maker-devel] Maker Bio::Root Error Dear Maker Developers, One of my Maker users is observing the following error when running maker on our systems: ------------- EXCEPTION: Bio::Root::BadParameter ------------- MSG: ' 9.1' is not a valid score VALUE: 9.1 STACK: Error::throw STACK: Bio::Root::Root::throw /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/Root/Root.pm:449 STACK: Bio::SeqFeature::Generic::score /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/SeqFeature/Generic.pm:468 STACK: GFFDB::_ary_to_features /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:891 STACK: GFFDB::phathits_on_chunk /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:534 STACK: Process::MpiChunk::_go /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:756 STACK: Process::MpiChunk::run /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:341 STACK: Process::MpiChunk::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:357 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:287 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:287 STACK: /tgac/software/testing/bin/core/../..//maker/2.31.6/x86_64/bin/maker:686 -------------------------------------------------------------- --> rank=NA, hostname=UV00000010-P002 ERROR: Failed while doing repeat masking When the user runs with the '-RM_off' option, everything is fine but fails with the above error when not applying that option. I was just wondering if anyone had any insight into what might be causing this? Regards, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 8 15:15:12 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 08 Oct 2014 14:15:12 -0600 Subject: [maker-devel] Maker Bio::Root Error In-Reply-To: References: Message-ID: Glad it's working. Thanks, Carson From: "Timothy Stitt (TGAC)" Date: Wednesday, October 8, 2014 at 2:13 PM To: Carson Holt Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Maker Bio::Root Error Thanks Carson?that did the trick. The user was able to format his input file correctly and all worked well. Great support, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt Date: Tuesday, 7 October 2014 17:17 To: Timothy Stitt Cc: "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Maker Bio::Root Error His file is not formatted correctly. Values should be tab delimited, but in several cases he has leading space characters contaminating the values. He needs to find and remove the contaminating white space. Here is the GFF3 specification just for reference --> http://www.sequenceontology.org/gff3.shtml. Here is an example perl script that could do this (cut and paste it into a file if you want)--> #!/usr/bin/perl use strict; my $file = shift; open(IN, "< $file"); while(my $line = ){ my @F = split(/\t/, $line); chomp($F[-1]); @F = map {s/^\s|\s$//g; $_} @F; print join("\t", @F)."\n"; } close(IN); Then run it as follows --> perl fixgff3_script.pl old_file.gff > new_file.gff Thanks, Carson From: "Timothy Stitt (TGAC)" Date: Tuesday, October 7, 2014 at 1:34 AM To: Carson Holt Subject: Re: [maker-devel] Maker Bio::Root Error Hi Carson, I spoke with the user and it does seem they have some confusion over what is a well-formed GFF3 file (they mention there is no example template for them to copy on the MAKER website). I am attaching the user's GFF3 file. Could you have a quick scan to determine if they are using an incorrect format? Any advice greatly received. Thanks, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk From: Carson Holt Date: Sunday, 5 October 2014 23:58 To: Timothy Stitt , "maker-devel at yandell-lab.org" Subject: Re: [maker-devel] Maker Bio::Root Error The location of the error is when MAKER tries to read a user provided GFF3 file, and then BioPerl is saying one of the values is invalid. Looking at the single quotes around the value, it appears that there is some contaminating whitespace. There may be other problems with the GFF3 file as well. I could take look if you want. Thanks, Carson From: "Timothy Stitt (TGAC)" Date: Saturday, October 4, 2014 at 9:14 AM To: "maker-devel at yandell-lab.org" Subject: [maker-devel] Maker Bio::Root Error Dear Maker Developers, One of my Maker users is observing the following error when running maker on our systems: ------------- EXCEPTION: Bio::Root::BadParameter ------------- MSG: ' 9.1' is not a valid score VALUE: 9.1 STACK: Error::throw STACK: Bio::Root::Root::throw /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/Root/ Root.pm:449 STACK: Bio::SeqFeature::Generic::score /tgac/software/testing/perl_activeperl/5.18.2.1802/x86_64/site/lib/Bio/SeqFe ature/Generic.pm:468 STACK: GFFDB::_ary_to_features /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:891 STACK: GFFDB::phathits_on_chunk /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/GFFDB.pm:534 STACK: Process::MpiChunk::_go /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:75 6 STACK: Process::MpiChunk::run /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:34 1 STACK: Process::MpiChunk::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiChunk.pm:35 7 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:28 7 STACK: Process::MpiTiers::run_all /tgac/software/testing/maker/2.31.6/x86_64/bin/../lib/Process/MpiTiers.pm:28 7 STACK: /tgac/software/testing/bin/core/../..//maker/2.31.6/x86_64/bin/maker:686 -------------------------------------------------------------- --> rank=NA, hostname=UV00000010-P002 ERROR: Failed while doing repeat masking When the user runs with the '-RM_off' option, everything is fine but fails with the above error when not applying that option. I was just wondering if anyone had any insight into what might be causing this? Regards, Tim. --- Timothy Stitt PhD / Head of Scientific Computing The Genome Analysis Centre (TGAC) http://www.tgac.ac.uk/ p: +44 1603 450378 e: timothy.stitt at tgac.ac.uk _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Fri Oct 17 15:33:55 2014 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Fri, 17 Oct 2014 20:33:55 +0000 Subject: [maker-devel] missing gene coordinate Message-ID: Hi, I have some strange results after I have added non-overlapping predictions that had InterPro domains using the "pred_gff" option and "keep_preds=1". About 19 genes out of 13,000 total now have a missing start coordinate. I copied an example below (maker-168-augustus-gene-0.1). This particular gene was not even a non-overlapping prediction and it had proper coordinates before I added the non-overlapping predictions. Brian 168 maker five_prime_UTR 1610 1711 . + . ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:five_prime_utr;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 168 maker CDS 1712 1784 . + 0 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 168 maker CDS 1846 1885 . + 2 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 168 maker CDS 2042 2145 . + 1 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 168 maker CDS 2219 2619 . + 2 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 168 maker gene 6737 . + . ID=maker-168-augustus-gene-0.1;Name=maker-168-augustus-gene-0.1 168 maker mRNA 5531 6737 . + . ID=maker-168-augustus-gene-0.1-mRNA-1;Parent=maker-168-augustus-gene-0.1;Name=maker-168-augustus-gene-0.1-mRNA-1;_AED=0.37;_eAED=0.37;_QI=83|0.6|0.66|1|1|1|6|203|184 168 maker exon 5531 5635 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3421;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker exon 5692 5802 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3422;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker exon 5880 5964 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3423;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker exon 6024 6087 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3424;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker exon 6193 6314 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3425;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker exon 6384 6737 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3426;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker five_prime_UTR 5531 5613 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:five_prime_utr;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker CDS 5614 5635 . + 0 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker CDS 5692 5802 . + 2 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker CDS 5880 5964 . + 2 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker CDS 6024 6087 . + 1 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker CDS 6193 6314 . + 0 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker CDS 6384 6534 . + 1 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 168 maker three_prime_UTR 6535 6737 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:three_prime_utr;Parent=maker-168-augustus-gene-0.1-mRNA-1 This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 20 13:20:41 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 20 Oct 2014 12:20:41 -0600 Subject: [maker-devel] missing gene coordinate In-Reply-To: References: Message-ID: <6BE786D1-B5B6-449E-952D-20309FB44D29@gmail.com> Could you send me your maker control files, one of the contigs this occurs on, and the input gff3 file you are using? Thanks, Carson > On Oct 17, 2014, at 2:33 PM, Mack, Brian wrote: > > Hi, > > I have some strange results after I have added non-overlapping predictions that had InterPro domains using the "pred_gff" option and ?keep_preds=1?. About 19 genes out of 13,000 total now have a missing start coordinate. I copied an example below (maker-168-augustus-gene-0.1). This particular gene was not even a non-overlapping prediction and it had proper coordinates before I added the non-overlapping predictions. > > Brian > > 168 maker five_prime_UTR 1610 1711 . + . ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:five_prime_utr;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 > 168 maker CDS 1712 1784 . + 0 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 > 168 maker CDS 1846 1885 . + 2 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 > 168 maker CDS 2042 2145 . + 1 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 > 168 maker CDS 2219 2619 . + 2 ID=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1:cds;Parent=maker-168-pred_gff_JCVI-gene-0.0-mRNA-1 > 168 maker gene 6737 . + . ID=maker-168-augustus-gene-0.1;Name=maker-168-augustus-gene-0.1 > 168 maker mRNA 5531 6737 . + . ID=maker-168-augustus-gene-0.1-mRNA-1;Parent=maker-168-augustus-gene-0.1;Name=maker-168-augustus-gene-0.1-mRNA-1;_AED=0.37;_eAED=0.37;_QI=83|0.6|0.66|1|1|1|6|203|184 > 168 maker exon 5531 5635 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3421;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker exon 5692 5802 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3422;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker exon 5880 5964 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3423;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker exon 6024 6087 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3424;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker exon 6193 6314 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3425;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker exon 6384 6737 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:exon:3426;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker five_prime_UTR 5531 5613 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:five_prime_utr;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker CDS 5614 5635 . + 0 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker CDS 5692 5802 . + 2 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker CDS 5880 5964 . + 2 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker CDS 6024 6087 . + 1 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker CDS 6193 6314 . + 0 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker CDS 6384 6534 . + 1 ID=maker-168-augustus-gene-0.1-mRNA-1:cds;Parent=maker-168-augustus-gene-0.1-mRNA-1 > 168 maker three_prime_UTR 6535 6737 . + . ID=maker-168-augustus-gene-0.1-mRNA-1:three_prime_utr;Parent=maker-168-augustus-gene-0.1-mRNA-1 > > > > > This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From allisonfuiten at gmail.com Mon Oct 20 16:39:20 2014 From: allisonfuiten at gmail.com (Allison Fuiten) Date: Mon, 20 Oct 2014 14:39:20 -0700 Subject: [maker-devel] Protein Evidence for teleost fish Message-ID: Hello, I am currently using Maker to annotate a *de novo* genome assembly for a teleost fish. I would like some clarification that I am using an appropriate set of protein evidence for the annotation pipeline. For mRNA/EST evidence, I am using two independent transcriptomes (assembled with Trinity) from my specific species. For the protein evidence, I am planning on using two proteomes from closely related model teleost species from the Ensembl database. From Ensembl, you can download all protein translations for a given species either resulting from known or novel gene models which are based on transcriptome & proteome data (the ?pep.all.fa? file) or resulting from 'ab initio' gene prediction algorithms solely based on the genomic sequence with no other experimental evidence (?pep.abinitio.fa? file). I?m planning on downloading the pep.all fasta files. Alternatively, after reading various posts on the Maker google group, I realize that I can also download proteomes from teleost fish from UniProt ( www.uniprot.org/proteomes). UniProt proteomes can contain both reviewed and unreviewed protein sequences and for the fish species I?m interested in downloading, they mostly contain unreviewed proteins. Do you recommend that I use the UniProt proteomes instead of the Ensembl proteomes? Also, there are actually four different model teleost species with available proteomes that are equally related to my teleost species. They?re all in different taxonomic orders, but that?s as closely related as I can get! Should I stick to just using proteomes from two species or should I up it to three or four? In addition, I have read in previous posts that you recommend using a comprehensive set of proteins from UniProt/Swissprot. Avoiding the unreviewed, UniProt/tremble datasets, should I download the complete, reviewed set of all UniProt/Swissprot proteins (uniprot_sprot.fasta.gz)? Under taxonomic divisions, there seems to be an option to download just vertebrate Uniprot/Swissprot proteins (the uniprot_sprot_vertebrates.dat.gz file). This only seems to be available in a .dat file format, but converting a .dat file into a .fasta seems to be possible. My apologies if you have already answered these questions in the past. Any help on these points will be greatly appreciated. Thank you, Allison -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 20 17:18:29 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 20 Oct 2014 16:18:29 -0600 Subject: [maker-devel] Protein Evidence for teleost fish In-Reply-To: References: Message-ID: <7C52537D-C35D-41BA-A442-916F41436F8E@gmail.com> You can use as many proteomes as you would like. The tradeoff is that runtime increases. 95% of runtime is just evidence alignment, so twice the evidence means about twice the runtime. Use at least 2 related proteomes, and perhaps all of UniProt/swiss-prot, which is very well curated and will contain a number of distant outgroups for genes that are conserved across species. ?Carson > On Oct 20, 2014, at 3:39 PM, Allison Fuiten wrote: > > Hello, > > > I am currently using Maker to annotate a de novo genome assembly for a teleost fish. I would like some clarification that I am using an appropriate set of protein evidence for the annotation pipeline. For mRNA/EST evidence, I am using two independent transcriptomes (assembled with Trinity) from my specific species. > > > For the protein evidence, I am planning on using two proteomes from closely related model teleost species from the Ensembl database. From Ensembl, you can download all protein translations for a given species either resulting from known or novel gene models which are based on transcriptome & proteome data (the ?pep.all.fa? file) or resulting from 'ab initio' gene prediction algorithms solely based on the genomic sequence with no other experimental evidence (?pep.abinitio.fa? file). I?m planning on downloading the pep.all fasta files. > > > Alternatively, after reading various posts on the Maker google group, I realize that I can also download proteomes from teleost fish from UniProt (www.uniprot.org/proteomes ). UniProt proteomes can contain both reviewed and unreviewed protein sequences and for the fish species I?m interested in downloading, they mostly contain unreviewed proteins. > > > Do you recommend that I use the UniProt proteomes instead of the Ensembl proteomes? > > > Also, there are actually four different model teleost species with available proteomes that are equally related to my teleost species. They?re all in different taxonomic orders, but that?s as closely related as I can get! Should I stick to just using proteomes from two species or should I up it to three or four? > > > In addition, I have read in previous posts that you recommend using a comprehensive set of proteins from UniProt/Swissprot. Avoiding the unreviewed, UniProt/tremble datasets, should I download the complete, reviewed set of all UniProt/Swissprot proteins (uniprot_sprot.fasta.gz <>)? Under taxonomic divisions, there seems to be an option to download just vertebrate Uniprot/Swissprot proteins (the uniprot_sprot_vertebrates.dat.gz <> file). This only seems to be available in a .dat file format, but converting a .dat file into a .fasta seems to be possible. > > > My apologies if you have already answered these questions in the past. Any help on these points will be greatly appreciated. > > > Thank you, > > > > Allison > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon Oct 20 17:40:29 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 20 Oct 2014 22:40:29 +0000 Subject: [maker-devel] Protein Evidence for teleost fish In-Reply-To: References: Message-ID: Hi Allison, I think that the Ensembl proteome dataset and the uniprot/swissprot dataset are likely to be useful for your annotation project. The omnibus nature of SwissProt (or one of the reduced datasets like UniProt90) helps to make sure that any proteins that might be missing from the closely-related species? proteomes (or the transcriptome datasets) can still be identified. Since the uniprot dataset for teleost fish contains unreviewed protein sequence, you probably want to avoid that. It?s also worth noting that the transcriptome dataset is necessary for identifying features like 3? and 5? UTRs and refining the structure of the gene models, so there?s a limit to the improvements that you?ll get to the MAKER results when you increase the size of the proteome dataset. I hope that helps. Feel free to let us know if you have anymore questions. Thanks, Daniel On Oct 20, 2014, at 3:39 PM, Allison Fuiten > wrote: Hello, I am currently using Maker to annotate a de novo genome assembly for a teleost fish. I would like some clarification that I am using an appropriate set of protein evidence for the annotation pipeline. For mRNA/EST evidence, I am using two independent transcriptomes (assembled with Trinity) from my specific species. For the protein evidence, I am planning on using two proteomes from closely related model teleost species from the Ensembl database. From Ensembl, you can download all protein translations for a given species either resulting from known or novel gene models which are based on transcriptome & proteome data (the ?pep.all.fa? file) or resulting from 'ab initio' gene prediction algorithms solely based on the genomic sequence with no other experimental evidence (?pep.abinitio.fa? file). I?m planning on downloading the pep.all fasta files. Alternatively, after reading various posts on the Maker google group, I realize that I can also download proteomes from teleost fish from UniProt (www.uniprot.org/proteomes). UniProt proteomes can contain both reviewed and unreviewed protein sequences and for the fish species I?m interested in downloading, they mostly contain unreviewed proteins. Do you recommend that I use the UniProt proteomes instead of the Ensembl proteomes? Also, there are actually four different model teleost species with available proteomes that are equally related to my teleost species. They?re all in different taxonomic orders, but that?s as closely related as I can get! Should I stick to just using proteomes from two species or should I up it to three or four? In addition, I have read in previous posts that you recommend using a comprehensive set of proteins from UniProt/Swissprot. Avoiding the unreviewed, UniProt/tremble datasets, should I download the complete, reviewed set of all UniProt/Swissprot proteins (uniprot_sprot.fasta.gz)? Under taxonomic divisions, there seems to be an option to download just vertebrate Uniprot/Swissprot proteins (the uniprot_sprot_vertebrates.dat.gz file). This only seems to be available in a .dat file format, but converting a .dat file into a .fasta seems to be possible. My apologies if you have already answered these questions in the past. Any help on these points will be greatly appreciated. Thank you, Allison _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Thu Oct 23 14:05:12 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Thu, 23 Oct 2014 19:05:12 +0000 Subject: [maker-devel] MAKER error Failed while polishing proteins Message-ID: <1414091111842.42181@uga.edu> Hi, I get this error when I run MAKER ERROR: Failed while polishing proteins ERROR: Chunk failed at level:10, tier_type:3 FAILED CONTIG:scaffold00006 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold00006 Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? Thanks, Ranjani -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 23 14:20:08 2014 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 23 Oct 2014 13:20:08 -0600 Subject: [maker-devel] MAKER error Failed while polishing proteins In-Reply-To: <1414091111842.42181@uga.edu> References: <1414091111842.42181@uga.edu> Message-ID: The causal error will be further up the error log. You may just want to send your capture STDERR file. ?Carson > On Oct 23, 2014, at 1:05 PM, Sivaranjani Namasivayam wrote: > > Hi, > > I get this error when I run MAKER > > ERROR: Failed while polishing proteins > ERROR: Chunk failed at level:10, tier_type:3 > FAILED CONTIG:scaffold00006 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold00006 > > Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? > > Thanks, > Ranjani > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From iarmean at ebi.ac.uk Fri Oct 24 08:17:51 2014 From: iarmean at ebi.ac.uk (Irina Armean) Date: Fri, 24 Oct 2014 14:17:51 +0100 Subject: [maker-devel] MAKER contig: Non-unique top level ID Message-ID: Dear all, I can't get a subset of contigs to finish, they keep failing. The log below is of a 519 992 long and max_dna_len=100000. No gff file is created. Just evidence_0.gff.an contains data (blast, protein2genome), the other 4 are empty (evidence_1, evidence_2, evidence_3, evidence_4). ...processing 30 of 32 ...processing 31 of 32 total clusters:5 now processing 0 flattening protein clusters prepare section files Gathering GFF3 input into hits - chunk:0 ERROR: Non-unique top level ID for While this is technically legal in GFF3, it usually indicates a poorly fomatted GFF3 file (perhaps you tried to merge two GFF3 files without accounting for unique IDs). MAKER will not handle these correctly. --> rank=NA, hostname=ebi3-202.ebi.ac.uk ERROR: Failed while prepare section files ERROR: Chunk failed at level:12, tier_type:3 FAILED CONTIG:scaffold2 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold2 examining contents of the fasta file and run log --Next Contig-- Any suggestions much appreciated. Best wishes, Irina -- -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Oct 24 08:51:25 2014 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 24 Oct 2014 13:51:25 +0000 Subject: [maker-devel] MAKER contig: Non-unique top level ID In-Reply-To: References: Message-ID: Hi Irina, Have you looked through your input for the duplicate IDs that the error talks about? It looks like some of your input files contain features with the same ID. ~Daniel On Oct 24, 2014, at 7:17 AM, Irina Armean > wrote: Dear all, I can't get a subset of contigs to finish, they keep failing. The log below is of a 519 992 long and max_dna_len=100000. No gff file is created. Just evidence_0.gff.an contains data (blast, protein2genome), the other 4 are empty (evidence_1, evidence_2, evidence_3, evidence_4). ...processing 30 of 32 ...processing 31 of 32 total clusters:5 now processing 0 flattening protein clusters prepare section files Gathering GFF3 input into hits - chunk:0 ERROR: Non-unique top level ID for While this is technically legal in GFF3, it usually indicates a poorly fomatted GFF3 file (perhaps you tried to merge two GFF3 files without accounting for unique IDs). MAKER will not handle these correctly. --> rank=NA, hostname=ebi3-202.ebi.ac.uk ERROR: Failed while prepare section files ERROR: Chunk failed at level:12, tier_type:3 FAILED CONTIG:scaffold2 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold2 examining contents of the fasta file and run log --Next Contig-- Any suggestions much appreciated. Best wishes, Irina -- _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From iarmean at ebi.ac.uk Fri Oct 24 08:56:20 2014 From: iarmean at ebi.ac.uk (Irina Armean) Date: Fri, 24 Oct 2014 14:56:20 +0100 Subject: [maker-devel] MAKER contig: Non-unique top level ID In-Reply-To: References:

Message-ID: Hi Daniel, Thanks for the mail. Yes, that was one of the sources I considered, but as the same input gff3 files have been successfully used in other maker runs, I've excluded them from potential cause. Believe it has to do with the collation of the expected 5 chunk gffs of the scaffold2, have not had a confirmation or disapproval of this though. Best wishes, Irina On Fri, Oct 24, 2014 at 2:51 PM, Daniel Ence wrote: > Hi Irina, > > Have you looked through your input for the duplicate IDs that the error > talks about? It looks like some of your input files contain features with > the same ID. > > ~Daniel > > > On Oct 24, 2014, at 7:17 AM, Irina Armean wrote: > > Dear all, > > I can't get a subset of contigs to finish, they keep failing. > > The log below is of a 519 992 long and max_dna_len=100000. No gff file is > created. Just evidence_0.gff.an contains data (blast, protein2genome), > the other 4 are empty (evidence_1, evidence_2, evidence_3, evidence_4). > > > ...processing 30 of 32 > ...processing 31 of 32 > total clusters:5 now processing 0 > flattening protein clusters > prepare section files > Gathering GFF3 input into hits - chunk:0 > ERROR: Non-unique top level ID for > While this is technically legal in GFF3, it usually > indicates a poorly fomatted GFF3 file (perhaps you > tried to merge two GFF3 files without accounting for > unique IDs). MAKER will not handle these correctly. > > --> rank=NA, hostname=ebi3-202.ebi.ac.uk > ERROR: Failed while prepare section files > ERROR: Chunk failed at level:12, tier_type:3 > FAILED CONTIG:scaffold2 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold2 > > examining contents of the fasta file and run log > > > > --Next Contig-- > > Any suggestions much appreciated. > > Best wishes, > Irina > > -- > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 27 11:18:51 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 27 Oct 2014 10:18:51 -0600 Subject: [maker-devel] MAKER error Failed while polishing proteins In-Reply-To: <1414355724461.3883@uga.edu> References: <1414091111842.42181@uga.edu> <, <>> <1414355724461.3883@uga.edu> Message-ID: <91055BEB-51AA-4C13-9C9A-42B8D1FAB9C7@gmail.com> There is one warning message and one error message. First this one ?> WARNING: The fasta file contains sequences with names longer than 78 characters. Long names get trimmed by BLAST, making it harder to identify the source of an alignmnet. You might want to reformat the fasta file with shorter IDs. File_name:/escratch3/ranjani/ranjani_Oct_16/sn1_comparisons/uniprot_top_hits_sn1prot.fa Also this one ?> sh: File name too long The cause of your failures are the long sequence identifiers in your fasta. There are actually several potential downstream issues as a result, but the most immediate is that some temporary file names are going to be based off the sequence ID, and because your seq IDs are really long, they result in file names that are bigger than the largest allowed filename for the system. Here is one example of a long ID I see in your STDERR ?> SRCN_4312|tg_ortholog:TGME49_201150|EC:EC:3.6.3.12;EC:3.6.3.4;EC:3.6.3.8|scaff:scaffold00018|start:414787|end:435051|blastp_output:gi:401406237:ref:XP_003882568.1: You need to reformat the header of each FASTA entry to make it shorter, not just for MAKER but for programs MAKER uses. Sequence Identifiers should be separated from other comment information in the FASTA header by spaces or else the comments become part of the identifier in accordance with FASTA format. Thanks, Carson > On Oct 26, 2014, at 2:35 PM, Sivaranjani Namasivayam wrote: > > Hi Carson, > > Attaching the error file. (The file is too big to be posted to the mailing list). There are a number of errors reported about the fasta file, but I see these errors for the other scaffolds too and I get predictions and they run finished successfully , so I thought these errors might not be the reason.. > > Thanks, > Ranjani > > From: Carson Holt > Sent: Thursday, October 23, 2014 3:20 PM > To: Sivaranjani Namasivayam > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] MAKER error Failed while polishing proteins > > The causal error will be further up the error log. You may just want to send your capture STDERR file. > > ?Carson > > > >> On Oct 23, 2014, at 1:05 PM, Sivaranjani Namasivayam > wrote: >> >> Hi, >> >> I get this error when I run MAKER >> >> ERROR: Failed while polishing proteins >> ERROR: Chunk failed at level:10, tier_type:3 >> FAILED CONTIG:scaffold00006 >> >> ERROR: Chunk failed at level:4, tier_type:0 >> FAILED CONTIG:scaffold00006 >> >> Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? >> >> Thanks, >> Ranjani >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Mon Oct 27 11:49:58 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Mon, 27 Oct 2014 16:49:58 +0000 Subject: [maker-devel] MAKER error Failed while polishing proteins In-Reply-To: <91055BEB-51AA-4C13-9C9A-42B8D1FAB9C7@gmail.com> References: <1414091111842.42181@uga.edu> <, <>> <1414355724461.3883@uga.edu>, <91055BEB-51AA-4C13-9C9A-42B8D1FAB9C7@gmail.com> Message-ID: <1414428598274.96959@uga.edu> Hi Carson, The fasta headers were the problem, when I shortened them the run completed sucessfully. Although for other scaffolds too there were similar such long fasta headers but it worked fine. Thanks! Ranjani ________________________________ From: Carson Holt Sent: Monday, October 27, 2014 12:18 PM To: Sivaranjani Namasivayam Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER error Failed while polishing proteins There is one warning message and one error message. First this one -> WARNING: The fasta file contains sequences with names longer than 78 characters. Long names get trimmed by BLAST, making it harder to identify the source of an alignmnet. You might want to reformat the fasta file with shorter IDs. File_name:/escratch3/ranjani/ranjani_Oct_16/sn1_comparisons/uniprot_top_hits_sn1prot.fa Also this one -> sh: File name too long The cause of your failures are the long sequence identifiers in your fasta. There are actually several potential downstream issues as a result, but the most immediate is that some temporary file names are going to be based off the sequence ID, and because your seq IDs are really long, they result in file names that are bigger than the largest allowed filename for the system. Here is one example of a long ID I see in your STDERR -> SRCN_4312|tg_ortholog:TGME49_201150|EC:EC:3.6.3.12;EC:3.6.3.4;EC:3.6.3.8|scaff:scaffold00018|start:414787|end:435051|blastp_output:gi:401406237:ref:XP_003882568.1: You need to reformat the header of each FASTA entry to make it shorter, not just for MAKER but for programs MAKER uses. Sequence Identifiers should be separated from other comment information in the FASTA header by spaces or else the comments become part of the identifier in accordance with FASTA format. Thanks, Carson On Oct 26, 2014, at 2:35 PM, Sivaranjani Namasivayam > wrote: Hi Carson, Attaching the error file. (The file is too big to be posted to the mailing list). There are a number of errors reported about the fasta file, but I see these errors for the other scaffolds too and I get predictions and they run finished successfully , so I thought these errors might not be the reason.. Thanks, Ranjani ________________________________ From: Carson Holt > Sent: Thursday, October 23, 2014 3:20 PM To: Sivaranjani Namasivayam Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER error Failed while polishing proteins The causal error will be further up the error log. You may just want to send your capture STDERR file. -Carson On Oct 23, 2014, at 1:05 PM, Sivaranjani Namasivayam > wrote: Hi, I get this error when I run MAKER ERROR: Failed while polishing proteins ERROR: Chunk failed at level:10, tier_type:3 FAILED CONTIG:scaffold00006 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold00006 Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 27 11:53:39 2014 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 27 Oct 2014 10:53:39 -0600 Subject: [maker-devel] MAKER error Failed while polishing proteins In-Reply-To: <1414428598274.96959@uga.edu> References: <1414091111842.42181@uga.edu> <, <>> <1414355724461.3883@uga.edu> <, <91055BEB-51AA-4C13-9C9A-42B8D1FAB9C7@gmail.com> <>> <1414428598274.96959@uga.edu> Message-ID: <773F8348-EF84-48D8-866A-C98BB3DE106D@gmail.com> Glad it?s working. MAKER does some things to fix issues with long seq IDs in BLAST etc. by assigning temporary IDs and then translating back to the original ID, but because the ones in your file were so long it triggered a different issue with maximum length of file names. ?Carson > On Oct 27, 2014, at 10:49 AM, Sivaranjani Namasivayam wrote: > > Hi Carson, > > The fasta headers were the problem, when I shortened them the run completed sucessfully. Although for other scaffolds too there were similar such long fasta headers but it worked fine. > > Thanks! > Ranjani > From: Carson Holt > > Sent: Monday, October 27, 2014 12:18 PM > To: Sivaranjani Namasivayam > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] MAKER error Failed while polishing proteins > > There is one warning message and one error message. > > First this one ?> > WARNING: The fasta file contains sequences with names longer > than 78 characters. Long names get trimmed by BLAST, making > it harder to identify the source of an alignmnet. You might > want to reformat the fasta file with shorter IDs. > File_name:/escratch3/ranjani/ranjani_Oct_16/sn1_comparisons/uniprot_top_hits_sn1prot.fa > > Also this one ?> sh: File name too long > > The cause of your failures are the long sequence identifiers in your fasta. There are actually several potential downstream issues as a result, but the most immediate is that some temporary file names are going to be based off the sequence ID, and because your seq IDs are really long, they result in file names that are bigger than the largest allowed filename for the system. > > Here is one example of a long ID I see in your STDERR ?> > SRCN_4312|tg_ortholog:TGME49_201150|EC:EC:3.6.3.12;EC:3.6.3.4;EC:3.6.3.8|scaff:scaffold00018|start:414787|end:435051|blastp_output:gi:401406237:ref:XP_003882568.1: > > You need to reformat the header of each FASTA entry to make it shorter, not just for MAKER but for programs MAKER uses. Sequence Identifiers should be separated from other comment information in the FASTA header by spaces or else the comments become part of the identifier in accordance with FASTA format. > > Thanks, > Carson > > >> On Oct 26, 2014, at 2:35 PM, Sivaranjani Namasivayam > wrote: >> >> Hi Carson, >> >> Attaching the error file. (The file is too big to be posted to the mailing list). There are a number of errors reported about the fasta file, but I see these errors for the other scaffolds too and I get predictions and they run finished successfully , so I thought these errors might not be the reason.. >> >> Thanks, >> Ranjani >> >> From: Carson Holt > >> Sent: Thursday, October 23, 2014 3:20 PM >> To: Sivaranjani Namasivayam >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] MAKER error Failed while polishing proteins >> >> The causal error will be further up the error log. You may just want to send your capture STDERR file. >> >> ?Carson >> >> >> >>> On Oct 23, 2014, at 1:05 PM, Sivaranjani Namasivayam > wrote: >>> >>> Hi, >>> >>> I get this error when I run MAKER >>> >>> ERROR: Failed while polishing proteins >>> ERROR: Chunk failed at level:10, tier_type:3 >>> FAILED CONTIG:scaffold00006 >>> >>> ERROR: Chunk failed at level:4, tier_type:0 >>> FAILED CONTIG:scaffold00006 >>> >>> Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? >>> >>> Thanks, >>> Ranjani >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From muriel.grosb at gmail.com Tue Oct 28 05:04:26 2014 From: muriel.grosb at gmail.com (Muriel Gros-Balthazard) Date: Tue, 28 Oct 2014 11:04:26 +0100 Subject: [maker-devel] Training Augustus Message-ID: <544F6A2A.2050809@gmail.com> Hello ! I want to train Augustus for a non model organism and I have several questions about it ! I planned to follow the section "Training ab initio Gene predictors". So first, I need to generate a gene model using EST data. However, I was wondering how many sequences are necessary ? Indeed, my genome is 476 Mb and I have milllions of RNA seq data but it takes ages if I put all of them ! I tried with 1000 sequences and it takes 30 min but is that enought ? Or should I take more ? Secondly, we then obtain plenty of gff files, should we concatenate them ? And then, what to do ? Indeed, the help of maker explains for Snap, but I want to use Augustus. I found a script called |autoAug.pl| to train Augustus. What do you think of it ? Should I use it that way ? |autoAug.pl --singleCPU --useexisting --genome=mygenome.fasta --species=myspeciesname --cdna=EST.fasta --trainingset=genome.gff3| where EST.fasta is the file I used earlier to generate the gene model and genome.gff3 is the result of the gene model. However, I don't think that I obtained gff3 file from the first maker run. So should I generate gff3 from gff ??? Thanks a lot for your help, Muriel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.stajich at gmail.com Tue Oct 28 12:07:31 2014 From: jason.stajich at gmail.com (Jason Stajich) Date: Tue, 28 Oct 2014 10:07:31 -0700 Subject: [maker-devel] Training Augustus In-Reply-To: <544F6A2A.2050809@gmail.com> References: <544F6A2A.2050809@gmail.com> Message-ID: <577CD3DD-5832-4B49-94A2-EA0ACD65971C@gmail.com> Muriel - Be best if you take longest models - ones that have ATG and STOP ? my workflow is to analyze the data with Trinity to assemble the transcripts (genome guided) and then align these transcripts to the genome with PASA and take the longest ORFs using scripts provided with PASA to generate the best set for gene predictions. > On Oct 28, 2014, at 3:04 AM, Muriel Gros-Balthazard wrote: > > Hello ! > > I want to train Augustus for a non model organism and I have several questions about it ! > > I planned to follow the section "Training ab initio Gene predictors". > > So first, I need to generate a gene model using EST data. > However, I was wondering how many sequences are necessary ? > Indeed, my genome is 476 Mb and I have milllions of RNA seq data but it takes ages if I put all of them ! > I tried with 1000 sequences and it takes 30 min but is that enought ? Or should I take more ? > > Secondly, we then obtain plenty of gff files, should we concatenate them ? > > And then, what to do ? Indeed, the help of maker explains for Snap, but I want to use Augustus. > I found a script called autoAug.pl to train Augustus. > What do you think of it ? > > Should I use it that way ? > autoAug.pl --singleCPU --useexisting --genome=mygenome.fasta --species=myspeciesname --cdna=EST.fasta --trainingset=genome.gff3 > > > where EST.fasta is the file I used earlier to generate the gene model and genome.gff3 is the result of the gene model. > However, I don't think that I obtained gff3 file from the first maker run. > So should I generate gff3 from gff ??? > > Thanks a lot for your help, > > Muriel > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ranjani at uga.edu Thu Oct 23 16:39:24 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Thu, 23 Oct 2014 21:39:24 +0000 Subject: [maker-devel] MAKER error Failed while polishing proteins In-Reply-To: References: <1414091111842.42181@uga.edu>, Message-ID: <1414100367968.23485@uga.edu> Attaching the error file. There are a number of errors reported about the fasta file, but I see these errors for the other scaffolds too, so I thought these errors might not be the reason.. Thanks, Ranjani ________________________________ From: Carson Holt Sent: Thursday, October 23, 2014 3:20 PM To: Sivaranjani Namasivayam Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER error Failed while polishing proteins The causal error will be further up the error log. You may just want to send your capture STDERR file. -Carson On Oct 23, 2014, at 1:05 PM, Sivaranjani Namasivayam > wrote: Hi, I get this error when I run MAKER ERROR: Failed while polishing proteins ERROR: Chunk failed at level:10, tier_type:3 FAILED CONTIG:scaffold00006 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold00006 Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: submpi2.sh.e5771113 Type: application/octet-stream Size: 3129753 bytes Desc: submpi2.sh.e5771113 URL: From ranjani at uga.edu Fri Oct 24 11:38:35 2014 From: ranjani at uga.edu (Sivaranjani Namasivayam) Date: Fri, 24 Oct 2014 16:38:35 +0000 Subject: [maker-devel] MAKER error Failed while polishing proteins In-Reply-To: References: <1414091111842.42181@uga.edu>, Message-ID: <1414168716728.28603@uga.edu> Attaching the error file. There are a number of errors reported about the fasta file, but I see these errors for the other scaffolds too and I get predicitions and the run finished successfully , so I thought these errors might not be the reason.. Thanks, Ranjani ________________________________ From: Carson Holt Sent: Thursday, October 23, 2014 3:20 PM To: Sivaranjani Namasivayam Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER error Failed while polishing proteins The causal error will be further up the error log. You may just want to send your capture STDERR file. -Carson On Oct 23, 2014, at 1:05 PM, Sivaranjani Namasivayam > wrote: Hi, I get this error when I run MAKER ERROR: Failed while polishing proteins ERROR: Chunk failed at level:10, tier_type:3 FAILED CONTIG:scaffold00006 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold00006 Other scaffolds run fine, but this scaffold keeps failing. Would it mean something is wrong with the proteins in this scaffold? Thanks, Ranjani _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: submpi2.sh.e5771113.zip Type: application/zip Size: 123514 bytes Desc: submpi2.sh.e5771113.zip URL: From mike.thon at gmail.com Wed Oct 1 07:29:13 2014 From: mike.thon at gmail.com (Michael Thon) Date: Wed, 1 Oct 2014 15:29:13 +0200 Subject: [maker-devel] change log Message-ID: Hi - Is there a change log that will show me what has changed from version 2.31.5 to 2.31.6? Thanks From carson.holt at genetics.utah.edu Wed Oct 1 09:23:15 2014 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Wed, 1 Oct 2014 15:23:15 +0000 Subject: [maker-devel] URGENT: Re: maker failure with example data In-Reply-To: References:

Message-ID: Here is the latest GMOD tutorial (May 2014). http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 --Carson From: Goutham atla > Date: Tuesday, September 30, 2014 at 11:58 PM To: Marc H?ppner > Cc: Carson Holt >, "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] URGENT: Re: maker failure with example data Dear All, Thank you. I figured out th problem is with mpich2. I was behind mpich2 but was unsuccessful. I installed mpich v3 and its working fine now. Thank you all. The old GMDO tutorials are bit misleading as the new versions have come up. On Wed, Oct 1, 2014 at 11:09 AM, Marc H?ppner > wrote: Another possibility could be that MPICH2 wasn?t build properly, no? I remember something with enabling shared libraries during the compilation of mpich, without which the error below would appear. /Marc Marc P. Hoeppner, PhD Team Leader BILS Genome Annotation Platform Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se On 30 Sep 2014, at 21:33, Carson Holt > wrote: The message is warning that there are multiple instances of MAKER running, but no MPI communication. When you build MAKER (perl Build.PL step when installing MAKER), you need to specify the location of 'mpicc' and 'mpi.h' to build with MPI support. Otherwise you won't be able to link against MPICH2 shared libraries. You probably need to rerun that step. --Carson From: Goutham atla > Date: Tuesday, September 30, 2014 at 10:49 AM To: Carson Holt > Cc: "maker-devel at yandell-lab.org" > Subject: URGENT: Re: maker failure with example data Hi Carson, I figured out the problem is with RepeatMasker installation and I fixed it. I am running maker with MPICH2 and I get the following warning when I start it: STATUS: Processing and indexing input FASTA files... WARNING: Multiple MAKER processes have been started in the same directory. I would like to if this is common. Regards, Goutham On Tue, Sep 30, 2014 at 12:02 PM, Goutham atla > wrote: Dear Carson, Thank you for the reply. I reinstalled the BioPerl and now I am getting the following error on test data. ERROR: RepeatMasker failed --> rank=NA, hostname=motif ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:contig-dpp-500-500 On Mon, Sep 29, 2014 at 8:17 PM, Carson Holt > wrote: The error is caused by the BioPerl indexer returning an empty length for the indexed fasta sequence (possibly because of a corrupt index file or other reasons). You may need to reinstall BioPerl (use the CPAN version not the BioPerl-live version), or reinstall Berkley DB (used by the BioPerl indexer), or reinstall the Perl module DB_File via CPAN (Perl's interface to Berkley DB). After reinstalling BioPerl, delete the mpi_blastdb directory for the MAKER run before retrying. Also verify that the /tmp directory on your system or the directory pointed to by TMP= in the maker_opts,ctl file is not full and that TMP= is not set to an NFS mounted location. Thanks, Carson From: Goutham atla > Date: Monday, September 29, 2014 at 6:33 AM To: > Subject: maker failure with example data Dear All, I am running maker with the demo file, i.e dip_contig.fasta by keeping all other parameters in .ctl files as default. But it do not progress and shows the following message that the length of the sequence is 0. Can anybody help me ? --Next Contig-- MAKER WARNING: All old files will be erased before continuing #--------------------------------------------------------------------- Skipping the contig because it is too short!! SeqID: contig-dpp-500-500 Length: 0 #--------------------------------------------------------------------- Regards, Goutham -- Goutham Atla -- Goutham Atla _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Goutham Atla -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 1 09:30:30 2014 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 01 Oct 2014 09:30:30 -0600 Subject: [maker-devel] change log In-Reply-To: References: Message-ID: Since 2.31, updates are just bug fixes (no new features are expected until MAKER 3.0). 2.31.6 contains a single change, a fix for a plus strand spliced tRNA bug. Thanks, Carson On 10/1/14, 7:29 AM, "Michael Thon" wrote: >Hi - Is there a change log that will show me what has changed from >version 2.31.5 to 2.31.6? >Thanks > > >_______________________________________________ >maker-devel mailing list >maker-devel at box290.bluehost.com >http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carson.holt at genetics.utah.edu Wed Oct 1 10:20:33 2014 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Wed, 1 Oct 2014 16:20:33 +0000 Subject: [maker-devel] diff. numbers of geneson contigs vs. scaffolded genome In-Reply-To: <542BF8EB.7090800@env.ethz.ch> References: <541BCE0A.70806@env.ethz.ch> <7A60AB257EFF2B48B1F4C814817EA0537B651ADF@mxb1.hg.genetics.utah.edu> <5421695F.5040409@env.ethz.ch> <542BF8EB.7090800@env.ethz.ch> Message-ID: >1) created a species specific repeat library, or actually several >versions (e.g., filtered for hits on known plant transposable elements >etc., or filtering out hits on proper plant proteins), and ran Maker >with it on a subset of the genome. Whatever version of repeat library I >use, I get +/- 5% the same number of Maker approved proteins. I get >slightly more proteins with the "best" species specific repeat library, >so I think it does make a difference, however not a big one. >Interestingly, if I turn off the repeat masking totally, I get about 20% >more Maker approved protein models. So either I am doing something >totally wrong here or the repeat masking is working quite well with the >specific repeat libraries. You expect more proteins if you turn all repeat masking off because transposons encode real proteins and there will be a lot of them. Some plant species for example have inflated gene counts because they failed to properly remove transposons during annotation, and removing these false models is actually a major goal of many reannotation projects. Also because transposons can occur in the middle of a gene or in an intron, not masking them can actually cause the predictor to not call the surrounding genes (what you are really interested in), but rather you just a series of transposons. Try using RepeatModeler to build the repeat dataset. It is not so much that you only want repeats from your species in the dataset so much as it is adding any novel repeats that will not be in any dataset. For example, I normally run will all of RepBase together with the novel repeats identified by RepeatModeler. You want to find everything you can. > >2) filtered the non-overlapping ab-initio proteins with PFAM domains >according to your how-to. This works very nicely, thanks. However, I get >quite a lot of models with PFAM hits, even when stringently filtering >for e-value. For example, in the subset of the contiged genome I usually >get around 300 Maker models. And now I have an additional 180 from the >"non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20. >For e-value < 1e-10 it would be 280, almost twice the number of >proteins. Extrapolating this to the full genome, this would be more than >32'000 proteins. This seems a bit excessive and I am not sure if I am >even supposed to use such a stringent e-value filtering. One reason of >having so many additional proteins I can think of, is that augustus and >snap are predicting similar non-overlapping models for the same location >and of course they then both have a PFAM domain. I can actually see this >for some locations when I load the data in WebApollo. I can think of a >crude way to select only the "best" model for a location (while >preferably also considering the already Maker approved protein) but I >wonder if maybe there is already a solution for this in Maker? The non-overlapping ab-initio proteins are already non-redundant. They will not overlap each other or any of the genes already called by MAKER. Also make sure you have identified novel repeats for your species or these models will be full of transposons which WILL have PFAM domains. Just reading the names of identified domains lets you know if it's a repeat related protein. Also you must have your gene predictors trained on your species. You cannot use a related species as your model if trying to add genes via PFAM domain content. This is because you will get fragmented gene models from the predictors if you are using a related species, and since there is no overlapping evidence alignment to help correct for this (these are the unsupported models after all), then you will be adding very poor models back in. Thanks, Carson > >In short, I think the repeat masking seems not to be the problem (And I >think I have put quite some effort in the repeat library creation). On >the other hand, there are a lot of "good" models in the non-overlapping >proteins that could be filtered and promoted to proper models, if I only >could make the right selection. > >Maybe, based on these additional informations you could point out >additional tests, filtering approaches or analyses I could do to home-in >to the "good" gene models in the non-overlapping gene models (or Maker >approved gene models in general). > >Thanks again for your help! >Stefan > > > >On 25.09.14 20:17, Carson Holt wrote: >> Sorry for the slow reply. I was trying to locate a script that might be >> useful for you. >> >> I think a species specific repeat libary will be of most benefit here >> (it's surprising just how influential this step is). Also note that you >> should train SNAP and Augustus on your species and are not just using >> another related species as a stand in. >> >> With respect to PFAM domains, on some organisms you may not get a lot of >> cross species protein alignments because of divergence or assembly >>issues. >> This of course makes it harder to support these models with direct >>protein >> alignments. However you can run InterProscan over the >> non-overlapping.proteins.fasta file produced by MAKER (contains >> non-redundant rejected models). Because an HMM is used for domain >> identification, it can pick up protein domains that would not produce a >> significant BLAST alignment because of divergence. You can then add >>models >> with positive hits for protein domains back into your gene set. >> >> This ad hoc procedure usually can only increase gene counts by about 10% >> though for organisms where it's required. I've attached a script that >> makes generating results for these genes easier. >> >> 1. First you run InterProScan with just PFAM. >> 2. Then you take the IDs of all models that have a domain in the report >> and create a list (1 ID per line). >> 3. Next use the fasta_tool script that comes with MAKER together with >>the >> --select flag to separate just the positive hits (ID's in your list) >>from >> the non-overlapping.proteins.fasta and >>non-overlapping.transscripts.fasta >> files. >> 4. Use the attached script to separate just the positive hits (your ID >> list) from the GFF3. The script will upgrade match/match_part results to >> gene/mRNA/exon/CDS results and print them out for you. >> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to >> merge the selected/upgraded GFF3 entries and selected FASTA entries back >> into the original MAKER results. >> >> --Carson >> >> >> >> On 9/23/14, 6:36 AM, "Stefan Zoller" wrote: >> >>> Please forgive my ignorance, I am not entirely sure if I understand >>>your >>> question correctly, but I will try to answer. >>> As evidence we use: >>> 1) our own transcriptome (trinity assembled RNAseq, filtering out the >>> very low expression transcripts). >>> 2) all swissprot plant proteins, and several protein sets from closely >>> related plant species downloaded from NCBI. >>> I am not sure if the ab-initio predictions without evidence have pfamm >>> domains. Honestly, I would not know how to tell and how to >>> include/exclude. >>> I was assuming that we should not have too many Maker approved >>> predictions without evidence anyway, because we use "keeps_preds=0". >>> The numbers of gene predictions I mentioned in my email are the >>> predictions reported by the fasta_merge/gff3_merge scripts in the >>> "*maker.proteins.fasta". There are of course many more predictions in >>> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this >>>file). >>> >>> I hope I am not totally off with my answer. >>> Cheers, Stefan >>> >>> >>> >>> On 23.09.14 02:10, Mark Yandell wrote: >>>> Also are you numbers including the ab-inito predictions without >>>> evidence that have pfamm domains? >>>> >>>> cheers, >>>> >>>> >>>> --mark >>>> >>>> >>>> >>>> Mark Yandell >>>> Professor of Human Genetics >>>> H.A. & Edna Benning Presidential Endowed Chair >>>> Co-director USTAR Center for Genetic Discovery >>>> Eccles Institute of Human Genetics >>>> University of Utah >>>> 15 North 2030 East, Room 2100 >>>> Salt Lake City, UT 84112-5330 >>>> ph:801-587-7707 >>>> >>>> ________________________________________ >>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>>> Carson Holt [carson.holt at genetics.utah.edu] >>>> Sent: Monday, September 22, 2014 2:17 PM >>>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org >>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. >>>> scaffolded genome >>>> >>>> The contiged assembly is more likely to give spurious hits and >>>> alignments. >>>> They also can be harder to repeat mask. Also gene predictors can >>>> behave >>>> slightly different on small sequences than on longer ones. If you >>>>have >>>> fewer gene models than you expect, your first step should be to >>>>process >>>> the scaffolds with CEGMA. It will give you an estimate of the genomes >>>> "completeness". If CEGMA gives a 60% completeness value for example >>>> then >>>> you can expect to only recover 60% of the expected number of genes. >>>>Next >>>> you should run RepeatModeler of similar software to help generate a >>>> species specific repeat library. Under masked repeats can make >>>> predicting >>>> genes on longer scaffolds far more difficult for ab initio predictors. >>>> >>>> --Carson >>>> >>>> >>>> On 9/19/14, 12:32 AM, "Stefan Zoller" >>>>wrote: >>>> >>>>> Hi, >>>>> >>>>> I am working on the annotation of a plant genome (about 600MB) and we >>>>> have a reasonable draft assembly, a fairly good transcriptome and >>>>>quite >>>>> a few proteins from related species. We have also extensively trained >>>>> augustus and are also feeding genmark and snap predictions. >>>>> >>>>> Recently I noticed a behavior of Maker that seems fairly odd and >>>>>which >>>>> I >>>>> cannot explain at all. When I take the scaffolded genome (about 23000 >>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is >>>>> admittedly a bit on the low side and we have to work on this. >>>>>However, >>>>> when I break up the scaffolds into contigs at stretches of N longer >>>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. >>>>>Now >>>>> obviously 17'000 is more in the range what I would expect, so I am >>>>> inclined to go with these. I have looked at both annotations and the >>>>> evidence in WebApollo and the evidence alignments are identical for >>>>> both >>>>> runs. The approved genes seem to be the same, except for the >>>>>additional >>>>> ones in the "contiged" genome version. The additional gene models are >>>>> not necessarily at the ends of the contigs, so I think it has nothing >>>>> to >>>>> do with having the stretches of Ns nearby in the scaffolded genome. >>>>>Do >>>>> you have any idea why maker comes up with the additional numbers of >>>>> gene >>>>> models and how I could "convince" maker to give me the same gene >>>>>models >>>>> for the scaffolded assembly? >>>>> >>>>> Cheers, >>>>> Stefan >>>>> >>>>> >>>>> >>>>> -- >>>>> Stefan Zoller, PhD >>>>> Bioinformatics >>>>> Genetic Diversity Centre >>>>> ETH Zurich CHN E55.1 >>>>> Universit?tsstrasse 16 >>>>> 8092 Zurich >>>>> Switzerland >>>>> >>>>> Phone: +41 44 632 66 85 >>>>> E-Mail: stefan.zoller at env.ethz.ch >>>>> Web: www.gdc.ethz.ch >>>>> >>>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> >>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >-- >Stefan Zoller, PhD >Bioinformatics >Genetic Diversity Centre >ETH Zurich CHN E55.1 >Universit?tsstrasse 16 >8092 Zurich >Switzerland > >Phone: +41 44 632 66 85 >E-Mail: stefan.zoller at env.ethz.ch >Web: www.gdc.ethz.ch > From stefan.zoller at env.ethz.ch Wed Oct 1 06:51:55 2014 From: stefan.zoller at env.ethz.ch (Stefan Zoller) Date: Wed, 1 Oct 2014 14:51:55 +0200 Subject: [maker-devel] diff. numbers of geneson contigs vs. scaffolded genome In-Reply-To: References: <541BCE0A.70806@env.ethz.ch> <7A60AB257EFF2B48B1F4C814817EA0537B651ADF@mxb1.hg.genetics.utah.edu> <5421695F.5040409@env.ethz.ch> Message-ID: <542BF8EB.7090800@env.ethz.ch> Hi Carson Thanks again for your help and suggestions. They are very helpful indeed! I have now: 1) created a species specific repeat library, or actually several versions (e.g., filtered for hits on known plant transposable elements etc., or filtering out hits on proper plant proteins), and ran Maker with it on a subset of the genome. Whatever version of repeat library I use, I get +/- 5% the same number of Maker approved proteins. I get slightly more proteins with the "best" species specific repeat library, so I think it does make a difference, however not a big one. Interestingly, if I turn off the repeat masking totally, I get about 20% more Maker approved protein models. So either I am doing something totally wrong here or the repeat masking is working quite well with the specific repeat libraries. 2) filtered the non-overlapping ab-initio proteins with PFAM domains according to your how-to. This works very nicely, thanks. However, I get quite a lot of models with PFAM hits, even when stringently filtering for e-value. For example, in the subset of the contiged genome I usually get around 300 Maker models. And now I have an additional 180 from the "non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20. For e-value < 1e-10 it would be 280, almost twice the number of proteins. Extrapolating this to the full genome, this would be more than 32'000 proteins. This seems a bit excessive and I am not sure if I am even supposed to use such a stringent e-value filtering. One reason of having so many additional proteins I can think of, is that augustus and snap are predicting similar non-overlapping models for the same location and of course they then both have a PFAM domain. I can actually see this for some locations when I load the data in WebApollo. I can think of a crude way to select only the "best" model for a location (while preferably also considering the already Maker approved protein) but I wonder if maybe there is already a solution for this in Maker? In short, I think the repeat masking seems not to be the problem (And I think I have put quite some effort in the repeat library creation). On the other hand, there are a lot of "good" models in the non-overlapping proteins that could be filtered and promoted to proper models, if I only could make the right selection. Maybe, based on these additional informations you could point out additional tests, filtering approaches or analyses I could do to home-in to the "good" gene models in the non-overlapping gene models (or Maker approved gene models in general). Thanks again for your help! Stefan On 25.09.14 20:17, Carson Holt wrote: > Sorry for the slow reply. I was trying to locate a script that might be > useful for you. > > I think a species specific repeat libary will be of most benefit here > (it's surprising just how influential this step is). Also note that you > should train SNAP and Augustus on your species and are not just using > another related species as a stand in. > > With respect to PFAM domains, on some organisms you may not get a lot of > cross species protein alignments because of divergence or assembly issues. > This of course makes it harder to support these models with direct protein > alignments. However you can run InterProscan over the > non-overlapping.proteins.fasta file produced by MAKER (contains > non-redundant rejected models). Because an HMM is used for domain > identification, it can pick up protein domains that would not produce a > significant BLAST alignment because of divergence. You can then add models > with positive hits for protein domains back into your gene set. > > This ad hoc procedure usually can only increase gene counts by about 10% > though for organisms where it's required. I've attached a script that > makes generating results for these genes easier. > > 1. First you run InterProScan with just PFAM. > 2. Then you take the IDs of all models that have a domain in the report > and create a list (1 ID per line). > 3. Next use the fasta_tool script that comes with MAKER together with the > --select flag to separate just the positive hits (ID's in your list) from > the non-overlapping.proteins.fasta and non-overlapping.transscripts.fasta > files. > 4. Use the attached script to separate just the positive hits (your ID > list) from the GFF3. The script will upgrade match/match_part results to > gene/mRNA/exon/CDS results and print them out for you. > 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to > merge the selected/upgraded GFF3 entries and selected FASTA entries back > into the original MAKER results. > > --Carson > > > > On 9/23/14, 6:36 AM, "Stefan Zoller" wrote: > >> Please forgive my ignorance, I am not entirely sure if I understand your >> question correctly, but I will try to answer. >> As evidence we use: >> 1) our own transcriptome (trinity assembled RNAseq, filtering out the >> very low expression transcripts). >> 2) all swissprot plant proteins, and several protein sets from closely >> related plant species downloaded from NCBI. >> I am not sure if the ab-initio predictions without evidence have pfamm >> domains. Honestly, I would not know how to tell and how to >> include/exclude. >> I was assuming that we should not have too many Maker approved >> predictions without evidence anyway, because we use "keeps_preds=0". >> The numbers of gene predictions I mentioned in my email are the >> predictions reported by the fasta_merge/gff3_merge scripts in the >> "*maker.proteins.fasta". There are of course many more predictions in >> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this file). >> >> I hope I am not totally off with my answer. >> Cheers, Stefan >> >> >> >> On 23.09.14 02:10, Mark Yandell wrote: >>> Also are you numbers including the ab-inito predictions without >>> evidence that have pfamm domains? >>> >>> cheers, >>> >>> >>> --mark >>> >>> >>> >>> Mark Yandell >>> Professor of Human Genetics >>> H.A. & Edna Benning Presidential Endowed Chair >>> Co-director USTAR Center for Genetic Discovery >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> ph:801-587-7707 >>> >>> ________________________________________ >>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of >>> Carson Holt [carson.holt at genetics.utah.edu] >>> Sent: Monday, September 22, 2014 2:17 PM >>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org >>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. >>> scaffolded genome >>> >>> The contiged assembly is more likely to give spurious hits and >>> alignments. >>> They also can be harder to repeat mask. Also gene predictors can >>> behave >>> slightly different on small sequences than on longer ones. If you have >>> fewer gene models than you expect, your first step should be to process >>> the scaffolds with CEGMA. It will give you an estimate of the genomes >>> "completeness". If CEGMA gives a 60% completeness value for example >>> then >>> you can expect to only recover 60% of the expected number of genes. Next >>> you should run RepeatModeler of similar software to help generate a >>> species specific repeat library. Under masked repeats can make >>> predicting >>> genes on longer scaffolds far more difficult for ab initio predictors. >>> >>> --Carson >>> >>> >>> On 9/19/14, 12:32 AM, "Stefan Zoller" wrote: >>> >>>> Hi, >>>> >>>> I am working on the annotation of a plant genome (about 600MB) and we >>>> have a reasonable draft assembly, a fairly good transcriptome and quite >>>> a few proteins from related species. We have also extensively trained >>>> augustus and are also feeding genmark and snap predictions. >>>> >>>> Recently I noticed a behavior of Maker that seems fairly odd and which >>>> I >>>> cannot explain at all. When I take the scaffolded genome (about 23000 >>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is >>>> admittedly a bit on the low side and we have to work on this. However, >>>> when I break up the scaffolds into contigs at stretches of N longer >>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now >>>> obviously 17'000 is more in the range what I would expect, so I am >>>> inclined to go with these. I have looked at both annotations and the >>>> evidence in WebApollo and the evidence alignments are identical for >>>> both >>>> runs. The approved genes seem to be the same, except for the additional >>>> ones in the "contiged" genome version. The additional gene models are >>>> not necessarily at the ends of the contigs, so I think it has nothing >>>> to >>>> do with having the stretches of Ns nearby in the scaffolded genome. Do >>>> you have any idea why maker comes up with the additional numbers of >>>> gene >>>> models and how I could "convince" maker to give me the same gene models >>>> for the scaffolded assembly? >>>> >>>> Cheers, >>>> Stefan >>>> >>>> >>>> >>>> -- >>>> Stefan Zoller, PhD >>>> Bioinformatics >>>> Genetic Diversity Centre >>>> ETH Zurich CHN E55.1 >>>> Universit?tsstrasse 16 >>>> 8092 Zurich >>>> Switzerland >>>> >>>> Phone: +41 44 632 66 85 >>>> E-Mail: stefan.zoller at env.ethz.ch >>>> Web: www.gdc.ethz.ch >>>> >>>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Stefan Zoller, PhD Bioinformatics Genetic Diversity Centre ETH Zurich CHN E55.1 Universit?tsstrasse 16 8092 Zurich Switzerland Phone: +41 44 632 66 85 E-Mail: stefan.zoller at env.ethz.ch Web: www.gdc.ethz.ch From carson.holt at genetics.utah.edu Wed Oct 1 13:18:43 2014 From: carson.holt at genetics.utah.edu (Carson Holt) Date: Wed, 1 Oct 2014 19:18:43 +0000 Subject: [maker-devel] diff. numbers of geneson contigs vs. scaffolded genome In-Reply-To: <542C40D1.3070300@env.ethz.ch> References: <541BCE0A.70806@env.ethz.ch> <7A60AB257EFF2B48B1F4C814817EA0537B651ADF@mxb1.hg.genetics.utah.edu> <5421695F.5040409@env.ethz.ch> <542BF8EB.7090800@env.ethz.ch>