From arnstrm at gmail.com Wed Jul 1 09:42:53 2015 From: arnstrm at gmail.com (Arun Seetharam) Date: Wed, 1 Jul 2015 09:42:53 -0500 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist Message-ID: Hi all, I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides Once completed, I tried to create GFF file with the gff3_merge script: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, $ maker -base maker_R1 -dsindex STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /data003/TIL/maker_R1.maker.output/maker_R1_datastore To access files for individual sequences use the datastore index: /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log When I tried again to run the gff3_merge: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. Any help will greatly be appreciated. Thanks, -- Arun Seetharam -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 12:11:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:11:44 -0600 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist In-Reply-To: References: Message-ID: Your datastore log is munged. Sometimes happen with an IO collision. Delete it and run maker on a single CPU using the -dsindex option. All it will do is rebuild the datastore log. Takes less than 5 minutes. ?Carson > On Jul 1, 2015, at 8:42 AM, Arun Seetharam wrote: > > Hi all, > > I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: > > /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides > > Once completed, I tried to create GFF file with the gff3_merge script: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, > > $ maker -base maker_R1 -dsindex > STATUS: Parsing control files... > STATUS: Processing and indexing input FASTA files... > STATUS: Setting up database for any GFF3 input... > A data structure will be created for you at: > /data003/TIL/maker_R1.maker.output/maker_R1_datastore > > To access files for individual sequences use the datastore index: > /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log > > When I tried again to run the gff3_merge: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. > > Any help will greatly be appreciated. > > Thanks, > > -- > Arun Seetharam > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 12:16:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:16:15 -0600 Subject: [maker-devel] Best way to assemble RNA-seq for MAKER In-Reply-To: References: Message-ID: I would pool them. You will get better coverage of low expression transcripts. While there may be differently spliced transcripts among the tissues, MAKER and all gene prediction programs used by MAKER by default are not going to try and work out alternate splicing anyways. You can tell it to (altsplice= option), but your EST evidence has to be near perfect end-to-end for that to work. ?Carson > On Jun 29, 2015, at 4:45 PM, John Cornelius wrote: > > Hi I have a quick question, I have RNA-seq from several different tissue types and I was wondering, would it be better to pool them and assemble them as one large transcriptome? Or, should I assemble each tissue separately and then use MAKER to integrate the smaller assemblies into the annotation? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From eric.ganko at syngenta.com Mon Jul 6 10:37:28 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Mon, 6 Jul 2015 15:37:28 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome Message-ID: I'm hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I'm using an install of MAKER-P on the iForge system @ NCSA and I've successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn't processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don't have an enormous amount of supporting data- this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they've suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I'm not sure if MAKER is meant to run that way. I'd appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Jul 7 22:59:47 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 7 Jul 2015 20:59:47 -0700 Subject: [maker-devel] Ability to process transcriptomes from different assemblers Message-ID: Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 9 13:55:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 12:55:38 -0600 Subject: [maker-devel] Ability to process transcriptomes from different assemblers In-Reply-To: References: Message-ID: <4A01AD3F-2CD6-4252-9A72-A3FA1E835CFB@gmail.com> You should be able to just supply both. ?Carson > On Jul 7, 2015, at 9:59 PM, John Cornelius wrote: > > Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jul 9 14:01:13 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 13:01:13 -0600 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: References: Message-ID: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson > On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE wrote: > > I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. > > I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : > > TOTAL: 25000 seqs > STARTED: 3594 > FINISHED: 2979 > FAILED: 10 > RETRY: 9 > DIED_SKIPPED_PERMANENT: 0 > SKIPPED_SMALL: 7635 > > While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? > > Thanks, > Eric > > > This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.ganko at syngenta.com Thu Jul 9 15:36:59 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Thu, 9 Jul 2015 20:36:59 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> References: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Message-ID: Tuesday I ran the same option files, this time with 480 cores, and the annotation completed in ~6 hours. Perhaps I?m trying too many simultaneous writes at higher levels, or there is too much MPI communication as you mentioned? Thanks for the input on the RAM disk. -eric From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Thursday, July 09, 2015 3:01 PM To: Ganko Eric USRE Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER processing time in a 2Gb genome Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE > wrote: I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 04:50:56 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 11:50:56 +0200 Subject: [maker-devel] Repeats... Message-ID: Hi guys, I have finished running Maker on my genome, but get >800 genes (out of ~20,000) that have similarity to transposases. Except from RepBase, have also built a species-specific repeat library, so it's weird that I still have quite a few transposases in my gene set... The repeat masking-related parameters in my maker-opts.ctl file are: model_org=all #select a model organism for RepBase masking in RepeatMasker rmlib=consensi.fa.classified #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) Does anyone have an idea why I'm getting so many transposases? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 06:45:49 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 13:45:49 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: An additional question related to the previous. I searched my species-specific repeat library with InterProScan and can't find a single sequence with similarity to a transposable element... I would expect it to find at least a few transposases. Is there an explanation for this, or has something gone wrong? Thanks, P On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis wrote: > Hi guys, > > I have finished running Maker on my genome, but get >800 genes (out of > ~20,000) that have similarity to transposases. Except from RepBase, have > also built a species-specific repeat library, so it's weird that I still > have quite a few transposases in my gene set... > > The repeat masking-related parameters in my maker-opts.ctl file are: > > model_org=all #select a model organism for RepBase masking in RepeatMasker > rmlib=consensi.fa.classified #provide an organism specific repeat library > in fasta format for RepeatMasker > repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta > #provide a fasta file of transposable element proteins for RepeatRunner > rm_gff= #pre-identified repeat elements from an external GFF3 file > prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change > this), 1 = yes, 0 = no > softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg > and dust filtering) > > Does anyone have an idea why I'm getting so many transposases? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 13 04:45:08 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 13 Jul 2015 11:45:08 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: Hi Daniel, Thanks for the reply. No, I'm not using Genemark. I didn't check for overlap with RepeatMasker elements or transcript/protein evidence though. But since it is such an unexpected finding, I decided to do something simpler. So I took all 750 transposases with the same InterPro annotation (IS4 family transposases) and clustered them with CD-HIT (amino acid sequences). At 90% similarity threshold each transposase goes to its own cluster. At 80% I get 748 clusters... This means that even though these transposases belong to the same family, they have diverged quite a bit, so that they're no longer considered "repeat elements". And this explains why they were not filtered out by RepeatMasker and made it to the final gene set. On Fri, Jul 10, 2015 at 5:00 PM, Daniel Ence wrote: > Hi Panos, Without knowing how you made the species-specific repeat > library, I can't speak to why it's giving hits against repbase. As to the > 800 transposases, are they overlapped by repeat masker elements? Are they > supported by EST or protein evidence? Are you using Genemark? That > ab-initio predictor runs on the unmasked genome sequence, so if the > transposases are present in your evidence set, they could still show up as > gene models. > > ~Daniel > > Sent from my iPhone > > On Jul 10, 2015, at 5:45 AM, Panos Ioannidis > wrote: > > An additional question related to the previous. > > I searched my species-specific repeat library with InterProScan and can't > find a single sequence with similarity to a transposable element... > > I would expect it to find at least a few transposases. Is there an > explanation for this, or has something gone wrong? > > Thanks, > P > > > On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis < > panos.ioannidis at gmail.com> wrote: > >> Hi guys, >> >> I have finished running Maker on my genome, but get >800 genes (out of >> ~20,000) that have similarity to transposases. Except from RepBase, have >> also built a species-specific repeat library, so it's weird that I still >> have quite a few transposases in my gene set... >> >> The repeat masking-related parameters in my maker-opts.ctl file are: >> >> model_org=all #select a model organism for RepBase masking in RepeatMasker >> rmlib=consensi.fa.classified #provide an organism specific repeat library >> in fasta format for RepeatMasker >> repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta >> #provide a fasta file of transposable element proteins for RepeatRunner >> rm_gff= #pre-identified repeat elements from an external GFF3 file >> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change >> this), 1 = yes, 0 = no >> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg >> and dust filtering) >> >> Does anyone have an idea why I'm getting so many transposases? >> >> Thanks, >> Panos >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jul 15 19:36:22 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 15 Jul 2015 17:36:22 -0700 Subject: [maker-devel] Short introns Message-ID: Hi, Carson. I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 12:55:27 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 11:55:27 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: Message-ID: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; ?Carson > On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? > > 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. > > Thanks, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 14:11:32 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 12:11:32 -0700 Subject: [maker-devel] Short introns In-Reply-To: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 15:10:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 14:10:09 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> I can add it to the development version. ?Carson > On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: > > Hi, Carson. > > One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. > > I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. > > I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. > > I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? > > Thanks for your help, Carson. Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: > >> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >> >> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >> >> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >> >> ?Carson >> >> >>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>> >>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>> >>> Thanks, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 18:25:27 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 16:25:27 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:40:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 10:40:53 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson > On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: > > Hi, Carson. > > I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. > > I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. > > Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 12:20:54 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 10:20:54 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 12:24:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:24:20 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson > On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: > > Hi, Carson. > > I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: > >> That is weird. >> >> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >> >> ?Carson >> >> >> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. >>> >>> I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 12:36:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:36:46 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson > On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: > > Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. > > ?Carson > > > >> On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: >> >> Hi, Carson. >> >> I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? >> >> Cheers, >> Shaun >> >> >> >> >> -- >> http://sjackman.ca/ >> On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: >> >>> That is weird. >>> >>> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >>> >>> ?Carson >>> >>> >>> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 15:29:49 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 13:29:49 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Tue Jul 21 14:10:11 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Tue, 21 Jul 2015 12:10:11 -0700 Subject: [maker-devel] Cryptic ACG start codon Message-ID: Hi, Carson. I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 21 17:28:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 21 Jul 2015 16:28:09 -0600 Subject: [maker-devel] Cryptic ACG start codon In-Reply-To: References: Message-ID: MAKER uses the is_start_codon method from Bio::Tools::CodonTable to determine if a codon is a valid start codon. Right now I don?t have a way to swap out the codon table. There is a way to do it, but it?s not easy. If you edit ?/maker/lib/CGL/TranslationMachine.pm line 122, you can set the table id to be another one from the BioPerl docs ?> http://doc.bioperl.org/releases/bioperl-1.6.1/Bio/Tools/CodonTable.html#BEGIN1 Or you can manually add your own codon table. It won?t change the codon usage for aligners like BLAST and Exonerate, but if will allow you to specify another valid start codon. To do that, edit line 118 to add your own manual codon table my adding another ?M? below the position you want to make into a valid start codon. my $id = $self->add_table( 'Strict', 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG', '-----------------------------------M----------------------------'); $self->id($id); I don?t really know which string position goes with which three letter nucleotide code. You might have to reverse engineer that from the BioPerl docs in the link above. ?Carson > On Jul 21, 2015, at 1:10 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? > > I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From felix.bemm at uni-wuerzburg.de Mon Jul 27 08:46:44 2015 From: felix.bemm at uni-wuerzburg.de (Felix Bemm) Date: Mon, 27 Jul 2015 15:46:44 +0200 Subject: [maker-devel] Annotation of 32Mb pseudochromosome Message-ID: <55B63644.8020509@uni-wuerzburg.de> Hi, I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with DIED RANK 13:4:0:83 DIED COUNT 1 The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? Cheers, Felix -- University of W?rzburg, Department Bioinformatics Group Evolutionary Computational Biology Biocentre, 97074 W?rzburg, Germany From dence at genetics.utah.edu Mon Jul 27 11:15:29 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 27 Jul 2015 16:15:29 +0000 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <55B63644.8020509@uni-wuerzburg.de> References: <55B63644.8020509@uni-wuerzburg.de> Message-ID: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: > > Hi, > > I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with > > DIED RANK 13:4:0:83 > DIED COUNT 1 > > The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. > > The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? > > Cheers, > Felix > > -- > University of W?rzburg, Department Bioinformatics > Group Evolutionary Computational Biology > Biocentre, 97074 W?rzburg, Germany > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 27 12:12:58 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 27 Jul 2015 11:12:58 -0600 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> References: <55B63644.8020509@uni-wuerzburg.de> <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Message-ID: <5AB595A5-FDA9-4ED1-A21E-EE9F1D196E30@gmail.com> You can also try installing RepeatMasker with rmblast as the default aligner or hmmer as the default. That will alter it?s behavior. Also make sure you are using blast+ version 2.2.28. Do not use blast+ version 2.2.29 ?Carson > On Jul 27, 2015, at 10:15 AM, Daniel Ence wrote: > > Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: >> >> Hi, >> >> I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with >> >> DIED RANK 13:4:0:83 >> DIED COUNT 1 >> >> The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. >> >> The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? >> >> Cheers, >> Felix >> >> -- >> University of W?rzburg, Department Bioinformatics >> Group Evolutionary Computational Biology >> Biocentre, 97074 W?rzburg, Germany >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Thu Jul 30 11:27:45 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 30 Jul 2015 16:27:45 +0000 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: <40D3E41C-0915-4B42-BF9C-DD779F2D5D06@illinois.edu> Hi Shaun Ever get an answer on this one from the RepeatMasker folks? I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman > wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt > wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun -- http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 13:11:48 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 11:11:48 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Chris. Yes, I did get a response from the RepeatModeler author, Robert Hubley (cc?ed). There?s no public mailing list, as far as I know, so it?s all in private communication. Yes, RepeatModeler is non-deterministic. I suggested that the random seed be added as a parameter to RepeatModeler, and Robert agreed. I?m still not sure why the results were so variable (between 5 kbp and 30 kbp annotated as repeats, see table far below). Perhaps it?s because my genome is much smaller (6 Mbp) than the size of the random sample (40 Mbp) that RepeatModeler uses. See immediately below. Robert? Cheers, Shaun RepeatModeler Round # 1 ======================== Searching for Repeats -- Sampling from the database... - Gathering up to 40000000 bp - Final Sample Size = 6001210 bp ( 5937815 non ambiguous ) - Num Contigs Represented = 38 --? http://sjackman.ca/ On 2015-July-30 at 9:28:27 , Fields, Christopher J (cjfields at illinois.edu) wrote: Hi Shaun Ever get an answer on this one from the RepeatMasker folks? ?I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark?atp8?as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 18:21:53 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 16:21:53 -0700 Subject: [maker-devel] Short introns In-Reply-To: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: Hi, Carson. Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. Have you considered moving the MAKER development to GitHub? Thanks again. Cheers, Shaun diff --git a/protein.pm.orig b/protein.pm --- a/protein.pm.orig +++ b/protein.pm @@ -94,11 +94,11 @@ sub runExonerate { my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; $command .= "-m protein2genome --softmasktarget "; + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; $command .= " --percent $percent"; if ($matrix) { $command .= " --proteinsubmat $matrix"; } - $command .= " --showcigar "; $command .= " > $o_file"; my $w = new Widget::exonerate::protein2genome(); --? http://sjackman.ca/ On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com) wrote: I can add it to the development version. ?Carson On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the?$min_intron?parameter. Could this parameter be added to the?maker_opts.ctl?configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From arnstrm at gmail.com Wed Jul 1 08:42:53 2015 From: arnstrm at gmail.com (Arun Seetharam) Date: Wed, 1 Jul 2015 09:42:53 -0500 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist Message-ID: Hi all, I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides Once completed, I tried to create GFF file with the gff3_merge script: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, $ maker -base maker_R1 -dsindex STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /data003/TIL/maker_R1.maker.output/maker_R1_datastore To access files for individual sequences use the datastore index: /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log When I tried again to run the gff3_merge: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. Any help will greatly be appreciated. Thanks, -- Arun Seetharam -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 11:11:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:11:44 -0600 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist In-Reply-To: References: Message-ID: Your datastore log is munged. Sometimes happen with an IO collision. Delete it and run maker on a single CPU using the -dsindex option. All it will do is rebuild the datastore log. Takes less than 5 minutes. ?Carson > On Jul 1, 2015, at 8:42 AM, Arun Seetharam wrote: > > Hi all, > > I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: > > /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides > > Once completed, I tried to create GFF file with the gff3_merge script: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, > > $ maker -base maker_R1 -dsindex > STATUS: Parsing control files... > STATUS: Processing and indexing input FASTA files... > STATUS: Setting up database for any GFF3 input... > A data structure will be created for you at: > /data003/TIL/maker_R1.maker.output/maker_R1_datastore > > To access files for individual sequences use the datastore index: > /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log > > When I tried again to run the gff3_merge: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. > > Any help will greatly be appreciated. > > Thanks, > > -- > Arun Seetharam > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 11:16:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:16:15 -0600 Subject: [maker-devel] Best way to assemble RNA-seq for MAKER In-Reply-To: References: Message-ID: I would pool them. You will get better coverage of low expression transcripts. While there may be differently spliced transcripts among the tissues, MAKER and all gene prediction programs used by MAKER by default are not going to try and work out alternate splicing anyways. You can tell it to (altsplice= option), but your EST evidence has to be near perfect end-to-end for that to work. ?Carson > On Jun 29, 2015, at 4:45 PM, John Cornelius wrote: > > Hi I have a quick question, I have RNA-seq from several different tissue types and I was wondering, would it be better to pool them and assemble them as one large transcriptome? Or, should I assemble each tissue separately and then use MAKER to integrate the smaller assemblies into the annotation? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From eric.ganko at syngenta.com Mon Jul 6 09:37:28 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Mon, 6 Jul 2015 15:37:28 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome Message-ID: I'm hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I'm using an install of MAKER-P on the iForge system @ NCSA and I've successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn't processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don't have an enormous amount of supporting data- this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they've suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I'm not sure if MAKER is meant to run that way. I'd appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Jul 7 21:59:47 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 7 Jul 2015 20:59:47 -0700 Subject: [maker-devel] Ability to process transcriptomes from different assemblers Message-ID: Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 9 12:55:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 12:55:38 -0600 Subject: [maker-devel] Ability to process transcriptomes from different assemblers In-Reply-To: References: Message-ID: <4A01AD3F-2CD6-4252-9A72-A3FA1E835CFB@gmail.com> You should be able to just supply both. ?Carson > On Jul 7, 2015, at 9:59 PM, John Cornelius wrote: > > Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jul 9 13:01:13 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 13:01:13 -0600 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: References: Message-ID: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson > On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE wrote: > > I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. > > I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : > > TOTAL: 25000 seqs > STARTED: 3594 > FINISHED: 2979 > FAILED: 10 > RETRY: 9 > DIED_SKIPPED_PERMANENT: 0 > SKIPPED_SMALL: 7635 > > While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? > > Thanks, > Eric > > > This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.ganko at syngenta.com Thu Jul 9 14:36:59 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Thu, 9 Jul 2015 20:36:59 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> References: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Message-ID: Tuesday I ran the same option files, this time with 480 cores, and the annotation completed in ~6 hours. Perhaps I?m trying too many simultaneous writes at higher levels, or there is too much MPI communication as you mentioned? Thanks for the input on the RAM disk. -eric From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Thursday, July 09, 2015 3:01 PM To: Ganko Eric USRE Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER processing time in a 2Gb genome Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE > wrote: I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 03:50:56 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 11:50:56 +0200 Subject: [maker-devel] Repeats... Message-ID: Hi guys, I have finished running Maker on my genome, but get >800 genes (out of ~20,000) that have similarity to transposases. Except from RepBase, have also built a species-specific repeat library, so it's weird that I still have quite a few transposases in my gene set... The repeat masking-related parameters in my maker-opts.ctl file are: model_org=all #select a model organism for RepBase masking in RepeatMasker rmlib=consensi.fa.classified #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) Does anyone have an idea why I'm getting so many transposases? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 05:45:49 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 13:45:49 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: An additional question related to the previous. I searched my species-specific repeat library with InterProScan and can't find a single sequence with similarity to a transposable element... I would expect it to find at least a few transposases. Is there an explanation for this, or has something gone wrong? Thanks, P On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis wrote: > Hi guys, > > I have finished running Maker on my genome, but get >800 genes (out of > ~20,000) that have similarity to transposases. Except from RepBase, have > also built a species-specific repeat library, so it's weird that I still > have quite a few transposases in my gene set... > > The repeat masking-related parameters in my maker-opts.ctl file are: > > model_org=all #select a model organism for RepBase masking in RepeatMasker > rmlib=consensi.fa.classified #provide an organism specific repeat library > in fasta format for RepeatMasker > repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta > #provide a fasta file of transposable element proteins for RepeatRunner > rm_gff= #pre-identified repeat elements from an external GFF3 file > prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change > this), 1 = yes, 0 = no > softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg > and dust filtering) > > Does anyone have an idea why I'm getting so many transposases? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 13 03:45:08 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 13 Jul 2015 11:45:08 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: Hi Daniel, Thanks for the reply. No, I'm not using Genemark. I didn't check for overlap with RepeatMasker elements or transcript/protein evidence though. But since it is such an unexpected finding, I decided to do something simpler. So I took all 750 transposases with the same InterPro annotation (IS4 family transposases) and clustered them with CD-HIT (amino acid sequences). At 90% similarity threshold each transposase goes to its own cluster. At 80% I get 748 clusters... This means that even though these transposases belong to the same family, they have diverged quite a bit, so that they're no longer considered "repeat elements". And this explains why they were not filtered out by RepeatMasker and made it to the final gene set. On Fri, Jul 10, 2015 at 5:00 PM, Daniel Ence wrote: > Hi Panos, Without knowing how you made the species-specific repeat > library, I can't speak to why it's giving hits against repbase. As to the > 800 transposases, are they overlapped by repeat masker elements? Are they > supported by EST or protein evidence? Are you using Genemark? That > ab-initio predictor runs on the unmasked genome sequence, so if the > transposases are present in your evidence set, they could still show up as > gene models. > > ~Daniel > > Sent from my iPhone > > On Jul 10, 2015, at 5:45 AM, Panos Ioannidis > wrote: > > An additional question related to the previous. > > I searched my species-specific repeat library with InterProScan and can't > find a single sequence with similarity to a transposable element... > > I would expect it to find at least a few transposases. Is there an > explanation for this, or has something gone wrong? > > Thanks, > P > > > On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis < > panos.ioannidis at gmail.com> wrote: > >> Hi guys, >> >> I have finished running Maker on my genome, but get >800 genes (out of >> ~20,000) that have similarity to transposases. Except from RepBase, have >> also built a species-specific repeat library, so it's weird that I still >> have quite a few transposases in my gene set... >> >> The repeat masking-related parameters in my maker-opts.ctl file are: >> >> model_org=all #select a model organism for RepBase masking in RepeatMasker >> rmlib=consensi.fa.classified #provide an organism specific repeat library >> in fasta format for RepeatMasker >> repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta >> #provide a fasta file of transposable element proteins for RepeatRunner >> rm_gff= #pre-identified repeat elements from an external GFF3 file >> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change >> this), 1 = yes, 0 = no >> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg >> and dust filtering) >> >> Does anyone have an idea why I'm getting so many transposases? >> >> Thanks, >> Panos >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jul 15 18:36:22 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 15 Jul 2015 17:36:22 -0700 Subject: [maker-devel] Short introns Message-ID: Hi, Carson. I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 11:55:27 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 11:55:27 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: Message-ID: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; ?Carson > On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? > > 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. > > Thanks, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 13:11:32 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 12:11:32 -0700 Subject: [maker-devel] Short introns In-Reply-To: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 14:10:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 14:10:09 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> I can add it to the development version. ?Carson > On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: > > Hi, Carson. > > One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. > > I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. > > I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. > > I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? > > Thanks for your help, Carson. Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: > >> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >> >> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >> >> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >> >> ?Carson >> >> >>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>> >>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>> >>> Thanks, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 17:25:27 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 16:25:27 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 10:40:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 10:40:53 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson > On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: > > Hi, Carson. > > I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. > > I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. > > Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 11:20:54 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 10:20:54 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:24:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:24:20 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson > On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: > > Hi, Carson. > > I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: > >> That is weird. >> >> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >> >> ?Carson >> >> >> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. >>> >>> I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:36:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:36:46 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson > On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: > > Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. > > ?Carson > > > >> On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: >> >> Hi, Carson. >> >> I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? >> >> Cheers, >> Shaun >> >> >> >> >> -- >> http://sjackman.ca/ >> On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: >> >>> That is weird. >>> >>> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >>> >>> ?Carson >>> >>> >>> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 14:29:49 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 13:29:49 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Tue Jul 21 13:10:11 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Tue, 21 Jul 2015 12:10:11 -0700 Subject: [maker-devel] Cryptic ACG start codon Message-ID: Hi, Carson. I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 21 16:28:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 21 Jul 2015 16:28:09 -0600 Subject: [maker-devel] Cryptic ACG start codon In-Reply-To: References: Message-ID: MAKER uses the is_start_codon method from Bio::Tools::CodonTable to determine if a codon is a valid start codon. Right now I don?t have a way to swap out the codon table. There is a way to do it, but it?s not easy. If you edit ?/maker/lib/CGL/TranslationMachine.pm line 122, you can set the table id to be another one from the BioPerl docs ?> http://doc.bioperl.org/releases/bioperl-1.6.1/Bio/Tools/CodonTable.html#BEGIN1 Or you can manually add your own codon table. It won?t change the codon usage for aligners like BLAST and Exonerate, but if will allow you to specify another valid start codon. To do that, edit line 118 to add your own manual codon table my adding another ?M? below the position you want to make into a valid start codon. my $id = $self->add_table( 'Strict', 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG', '-----------------------------------M----------------------------'); $self->id($id); I don?t really know which string position goes with which three letter nucleotide code. You might have to reverse engineer that from the BioPerl docs in the link above. ?Carson > On Jul 21, 2015, at 1:10 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? > > I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From felix.bemm at uni-wuerzburg.de Mon Jul 27 07:46:44 2015 From: felix.bemm at uni-wuerzburg.de (Felix Bemm) Date: Mon, 27 Jul 2015 15:46:44 +0200 Subject: [maker-devel] Annotation of 32Mb pseudochromosome Message-ID: <55B63644.8020509@uni-wuerzburg.de> Hi, I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with DIED RANK 13:4:0:83 DIED COUNT 1 The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? Cheers, Felix -- University of W?rzburg, Department Bioinformatics Group Evolutionary Computational Biology Biocentre, 97074 W?rzburg, Germany From dence at genetics.utah.edu Mon Jul 27 10:15:29 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 27 Jul 2015 16:15:29 +0000 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <55B63644.8020509@uni-wuerzburg.de> References: <55B63644.8020509@uni-wuerzburg.de> Message-ID: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: > > Hi, > > I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with > > DIED RANK 13:4:0:83 > DIED COUNT 1 > > The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. > > The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? > > Cheers, > Felix > > -- > University of W?rzburg, Department Bioinformatics > Group Evolutionary Computational Biology > Biocentre, 97074 W?rzburg, Germany > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 27 11:12:58 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 27 Jul 2015 11:12:58 -0600 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> References: <55B63644.8020509@uni-wuerzburg.de> <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Message-ID: <5AB595A5-FDA9-4ED1-A21E-EE9F1D196E30@gmail.com> You can also try installing RepeatMasker with rmblast as the default aligner or hmmer as the default. That will alter it?s behavior. Also make sure you are using blast+ version 2.2.28. Do not use blast+ version 2.2.29 ?Carson > On Jul 27, 2015, at 10:15 AM, Daniel Ence wrote: > > Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: >> >> Hi, >> >> I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with >> >> DIED RANK 13:4:0:83 >> DIED COUNT 1 >> >> The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. >> >> The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? >> >> Cheers, >> Felix >> >> -- >> University of W?rzburg, Department Bioinformatics >> Group Evolutionary Computational Biology >> Biocentre, 97074 W?rzburg, Germany >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Thu Jul 30 10:27:45 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 30 Jul 2015 16:27:45 +0000 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: <40D3E41C-0915-4B42-BF9C-DD779F2D5D06@illinois.edu> Hi Shaun Ever get an answer on this one from the RepeatMasker folks? I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman > wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt > wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun -- http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 12:11:48 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 11:11:48 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Chris. Yes, I did get a response from the RepeatModeler author, Robert Hubley (cc?ed). There?s no public mailing list, as far as I know, so it?s all in private communication. Yes, RepeatModeler is non-deterministic. I suggested that the random seed be added as a parameter to RepeatModeler, and Robert agreed. I?m still not sure why the results were so variable (between 5 kbp and 30 kbp annotated as repeats, see table far below). Perhaps it?s because my genome is much smaller (6 Mbp) than the size of the random sample (40 Mbp) that RepeatModeler uses. See immediately below. Robert? Cheers, Shaun RepeatModeler Round # 1 ======================== Searching for Repeats -- Sampling from the database... - Gathering up to 40000000 bp - Final Sample Size = 6001210 bp ( 5937815 non ambiguous ) - Num Contigs Represented = 38 --? http://sjackman.ca/ On 2015-July-30 at 9:28:27 , Fields, Christopher J (cjfields at illinois.edu) wrote: Hi Shaun Ever get an answer on this one from the RepeatMasker folks? ?I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark?atp8?as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 17:21:53 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 16:21:53 -0700 Subject: [maker-devel] Short introns In-Reply-To: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: Hi, Carson. Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. Have you considered moving the MAKER development to GitHub? Thanks again. Cheers, Shaun diff --git a/protein.pm.orig b/protein.pm --- a/protein.pm.orig +++ b/protein.pm @@ -94,11 +94,11 @@ sub runExonerate { my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; $command .= "-m protein2genome --softmasktarget "; + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; $command .= " --percent $percent"; if ($matrix) { $command .= " --proteinsubmat $matrix"; } - $command .= " --showcigar "; $command .= " > $o_file"; my $w = new Widget::exonerate::protein2genome(); --? http://sjackman.ca/ On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com) wrote: I can add it to the development version. ?Carson On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the?$min_intron?parameter. Could this parameter be added to the?maker_opts.ctl?configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From arnstrm at gmail.com Wed Jul 1 08:42:53 2015 From: arnstrm at gmail.com (Arun Seetharam) Date: Wed, 1 Jul 2015 09:42:53 -0500 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist Message-ID: Hi all, I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides Once completed, I tried to create GFF file with the gff3_merge script: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, $ maker -base maker_R1 -dsindex STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /data003/TIL/maker_R1.maker.output/maker_R1_datastore To access files for individual sequences use the datastore index: /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log When I tried again to run the gff3_merge: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. Any help will greatly be appreciated. Thanks, -- Arun Seetharam -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 11:11:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:11:44 -0600 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist In-Reply-To: References: Message-ID: Your datastore log is munged. Sometimes happen with an IO collision. Delete it and run maker on a single CPU using the -dsindex option. All it will do is rebuild the datastore log. Takes less than 5 minutes. ?Carson > On Jul 1, 2015, at 8:42 AM, Arun Seetharam wrote: > > Hi all, > > I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: > > /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides > > Once completed, I tried to create GFF file with the gff3_merge script: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, > > $ maker -base maker_R1 -dsindex > STATUS: Parsing control files... > STATUS: Processing and indexing input FASTA files... > STATUS: Setting up database for any GFF3 input... > A data structure will be created for you at: > /data003/TIL/maker_R1.maker.output/maker_R1_datastore > > To access files for individual sequences use the datastore index: > /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log > > When I tried again to run the gff3_merge: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. > > Any help will greatly be appreciated. > > Thanks, > > -- > Arun Seetharam > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 11:16:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:16:15 -0600 Subject: [maker-devel] Best way to assemble RNA-seq for MAKER In-Reply-To: References: Message-ID: I would pool them. You will get better coverage of low expression transcripts. While there may be differently spliced transcripts among the tissues, MAKER and all gene prediction programs used by MAKER by default are not going to try and work out alternate splicing anyways. You can tell it to (altsplice= option), but your EST evidence has to be near perfect end-to-end for that to work. ?Carson > On Jun 29, 2015, at 4:45 PM, John Cornelius wrote: > > Hi I have a quick question, I have RNA-seq from several different tissue types and I was wondering, would it be better to pool them and assemble them as one large transcriptome? Or, should I assemble each tissue separately and then use MAKER to integrate the smaller assemblies into the annotation? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From eric.ganko at syngenta.com Mon Jul 6 09:37:28 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Mon, 6 Jul 2015 15:37:28 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome Message-ID: I'm hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I'm using an install of MAKER-P on the iForge system @ NCSA and I've successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn't processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don't have an enormous amount of supporting data- this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they've suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I'm not sure if MAKER is meant to run that way. I'd appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Jul 7 21:59:47 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 7 Jul 2015 20:59:47 -0700 Subject: [maker-devel] Ability to process transcriptomes from different assemblers Message-ID: Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 9 12:55:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 12:55:38 -0600 Subject: [maker-devel] Ability to process transcriptomes from different assemblers In-Reply-To: References: Message-ID: <4A01AD3F-2CD6-4252-9A72-A3FA1E835CFB@gmail.com> You should be able to just supply both. ?Carson > On Jul 7, 2015, at 9:59 PM, John Cornelius wrote: > > Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jul 9 13:01:13 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 13:01:13 -0600 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: References: Message-ID: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson > On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE wrote: > > I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. > > I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : > > TOTAL: 25000 seqs > STARTED: 3594 > FINISHED: 2979 > FAILED: 10 > RETRY: 9 > DIED_SKIPPED_PERMANENT: 0 > SKIPPED_SMALL: 7635 > > While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? > > Thanks, > Eric > > > This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.ganko at syngenta.com Thu Jul 9 14:36:59 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Thu, 9 Jul 2015 20:36:59 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> References: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Message-ID: Tuesday I ran the same option files, this time with 480 cores, and the annotation completed in ~6 hours. Perhaps I?m trying too many simultaneous writes at higher levels, or there is too much MPI communication as you mentioned? Thanks for the input on the RAM disk. -eric From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Thursday, July 09, 2015 3:01 PM To: Ganko Eric USRE Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER processing time in a 2Gb genome Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE > wrote: I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 03:50:56 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 11:50:56 +0200 Subject: [maker-devel] Repeats... Message-ID: Hi guys, I have finished running Maker on my genome, but get >800 genes (out of ~20,000) that have similarity to transposases. Except from RepBase, have also built a species-specific repeat library, so it's weird that I still have quite a few transposases in my gene set... The repeat masking-related parameters in my maker-opts.ctl file are: model_org=all #select a model organism for RepBase masking in RepeatMasker rmlib=consensi.fa.classified #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) Does anyone have an idea why I'm getting so many transposases? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 05:45:49 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 13:45:49 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: An additional question related to the previous. I searched my species-specific repeat library with InterProScan and can't find a single sequence with similarity to a transposable element... I would expect it to find at least a few transposases. Is there an explanation for this, or has something gone wrong? Thanks, P On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis wrote: > Hi guys, > > I have finished running Maker on my genome, but get >800 genes (out of > ~20,000) that have similarity to transposases. Except from RepBase, have > also built a species-specific repeat library, so it's weird that I still > have quite a few transposases in my gene set... > > The repeat masking-related parameters in my maker-opts.ctl file are: > > model_org=all #select a model organism for RepBase masking in RepeatMasker > rmlib=consensi.fa.classified #provide an organism specific repeat library > in fasta format for RepeatMasker > repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta > #provide a fasta file of transposable element proteins for RepeatRunner > rm_gff= #pre-identified repeat elements from an external GFF3 file > prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change > this), 1 = yes, 0 = no > softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg > and dust filtering) > > Does anyone have an idea why I'm getting so many transposases? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 13 03:45:08 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 13 Jul 2015 11:45:08 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: Hi Daniel, Thanks for the reply. No, I'm not using Genemark. I didn't check for overlap with RepeatMasker elements or transcript/protein evidence though. But since it is such an unexpected finding, I decided to do something simpler. So I took all 750 transposases with the same InterPro annotation (IS4 family transposases) and clustered them with CD-HIT (amino acid sequences). At 90% similarity threshold each transposase goes to its own cluster. At 80% I get 748 clusters... This means that even though these transposases belong to the same family, they have diverged quite a bit, so that they're no longer considered "repeat elements". And this explains why they were not filtered out by RepeatMasker and made it to the final gene set. On Fri, Jul 10, 2015 at 5:00 PM, Daniel Ence wrote: > Hi Panos, Without knowing how you made the species-specific repeat > library, I can't speak to why it's giving hits against repbase. As to the > 800 transposases, are they overlapped by repeat masker elements? Are they > supported by EST or protein evidence? Are you using Genemark? That > ab-initio predictor runs on the unmasked genome sequence, so if the > transposases are present in your evidence set, they could still show up as > gene models. > > ~Daniel > > Sent from my iPhone > > On Jul 10, 2015, at 5:45 AM, Panos Ioannidis > wrote: > > An additional question related to the previous. > > I searched my species-specific repeat library with InterProScan and can't > find a single sequence with similarity to a transposable element... > > I would expect it to find at least a few transposases. Is there an > explanation for this, or has something gone wrong? > > Thanks, > P > > > On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis < > panos.ioannidis at gmail.com> wrote: > >> Hi guys, >> >> I have finished running Maker on my genome, but get >800 genes (out of >> ~20,000) that have similarity to transposases. Except from RepBase, have >> also built a species-specific repeat library, so it's weird that I still >> have quite a few transposases in my gene set... >> >> The repeat masking-related parameters in my maker-opts.ctl file are: >> >> model_org=all #select a model organism for RepBase masking in RepeatMasker >> rmlib=consensi.fa.classified #provide an organism specific repeat library >> in fasta format for RepeatMasker >> repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta >> #provide a fasta file of transposable element proteins for RepeatRunner >> rm_gff= #pre-identified repeat elements from an external GFF3 file >> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change >> this), 1 = yes, 0 = no >> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg >> and dust filtering) >> >> Does anyone have an idea why I'm getting so many transposases? >> >> Thanks, >> Panos >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jul 15 18:36:22 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 15 Jul 2015 17:36:22 -0700 Subject: [maker-devel] Short introns Message-ID: Hi, Carson. I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 11:55:27 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 11:55:27 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: Message-ID: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; ?Carson > On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? > > 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. > > Thanks, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 13:11:32 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 12:11:32 -0700 Subject: [maker-devel] Short introns In-Reply-To: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 14:10:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 14:10:09 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> I can add it to the development version. ?Carson > On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: > > Hi, Carson. > > One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. > > I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. > > I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. > > I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? > > Thanks for your help, Carson. Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: > >> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >> >> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >> >> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >> >> ?Carson >> >> >>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>> >>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>> >>> Thanks, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 17:25:27 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 16:25:27 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 10:40:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 10:40:53 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson > On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: > > Hi, Carson. > > I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. > > I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. > > Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 11:20:54 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 10:20:54 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:24:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:24:20 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson > On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: > > Hi, Carson. > > I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: > >> That is weird. >> >> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >> >> ?Carson >> >> >> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. >>> >>> I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:36:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:36:46 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson > On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: > > Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. > > ?Carson > > > >> On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: >> >> Hi, Carson. >> >> I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? >> >> Cheers, >> Shaun >> >> >> >> >> -- >> http://sjackman.ca/ >> On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: >> >>> That is weird. >>> >>> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >>> >>> ?Carson >>> >>> >>> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 14:29:49 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 13:29:49 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Tue Jul 21 13:10:11 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Tue, 21 Jul 2015 12:10:11 -0700 Subject: [maker-devel] Cryptic ACG start codon Message-ID: Hi, Carson. I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 21 16:28:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 21 Jul 2015 16:28:09 -0600 Subject: [maker-devel] Cryptic ACG start codon In-Reply-To: References: Message-ID: MAKER uses the is_start_codon method from Bio::Tools::CodonTable to determine if a codon is a valid start codon. Right now I don?t have a way to swap out the codon table. There is a way to do it, but it?s not easy. If you edit ?/maker/lib/CGL/TranslationMachine.pm line 122, you can set the table id to be another one from the BioPerl docs ?> http://doc.bioperl.org/releases/bioperl-1.6.1/Bio/Tools/CodonTable.html#BEGIN1 Or you can manually add your own codon table. It won?t change the codon usage for aligners like BLAST and Exonerate, but if will allow you to specify another valid start codon. To do that, edit line 118 to add your own manual codon table my adding another ?M? below the position you want to make into a valid start codon. my $id = $self->add_table( 'Strict', 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG', '-----------------------------------M----------------------------'); $self->id($id); I don?t really know which string position goes with which three letter nucleotide code. You might have to reverse engineer that from the BioPerl docs in the link above. ?Carson > On Jul 21, 2015, at 1:10 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? > > I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From felix.bemm at uni-wuerzburg.de Mon Jul 27 07:46:44 2015 From: felix.bemm at uni-wuerzburg.de (Felix Bemm) Date: Mon, 27 Jul 2015 15:46:44 +0200 Subject: [maker-devel] Annotation of 32Mb pseudochromosome Message-ID: <55B63644.8020509@uni-wuerzburg.de> Hi, I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with DIED RANK 13:4:0:83 DIED COUNT 1 The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? Cheers, Felix -- University of W?rzburg, Department Bioinformatics Group Evolutionary Computational Biology Biocentre, 97074 W?rzburg, Germany From dence at genetics.utah.edu Mon Jul 27 10:15:29 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 27 Jul 2015 16:15:29 +0000 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <55B63644.8020509@uni-wuerzburg.de> References: <55B63644.8020509@uni-wuerzburg.de> Message-ID: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: > > Hi, > > I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with > > DIED RANK 13:4:0:83 > DIED COUNT 1 > > The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. > > The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? > > Cheers, > Felix > > -- > University of W?rzburg, Department Bioinformatics > Group Evolutionary Computational Biology > Biocentre, 97074 W?rzburg, Germany > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 27 11:12:58 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 27 Jul 2015 11:12:58 -0600 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> References: <55B63644.8020509@uni-wuerzburg.de> <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Message-ID: <5AB595A5-FDA9-4ED1-A21E-EE9F1D196E30@gmail.com> You can also try installing RepeatMasker with rmblast as the default aligner or hmmer as the default. That will alter it?s behavior. Also make sure you are using blast+ version 2.2.28. Do not use blast+ version 2.2.29 ?Carson > On Jul 27, 2015, at 10:15 AM, Daniel Ence wrote: > > Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: >> >> Hi, >> >> I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with >> >> DIED RANK 13:4:0:83 >> DIED COUNT 1 >> >> The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. >> >> The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? >> >> Cheers, >> Felix >> >> -- >> University of W?rzburg, Department Bioinformatics >> Group Evolutionary Computational Biology >> Biocentre, 97074 W?rzburg, Germany >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Thu Jul 30 10:27:45 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 30 Jul 2015 16:27:45 +0000 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: <40D3E41C-0915-4B42-BF9C-DD779F2D5D06@illinois.edu> Hi Shaun Ever get an answer on this one from the RepeatMasker folks? I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman > wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt > wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun -- http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 12:11:48 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 11:11:48 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Chris. Yes, I did get a response from the RepeatModeler author, Robert Hubley (cc?ed). There?s no public mailing list, as far as I know, so it?s all in private communication. Yes, RepeatModeler is non-deterministic. I suggested that the random seed be added as a parameter to RepeatModeler, and Robert agreed. I?m still not sure why the results were so variable (between 5 kbp and 30 kbp annotated as repeats, see table far below). Perhaps it?s because my genome is much smaller (6 Mbp) than the size of the random sample (40 Mbp) that RepeatModeler uses. See immediately below. Robert? Cheers, Shaun RepeatModeler Round # 1 ======================== Searching for Repeats -- Sampling from the database... - Gathering up to 40000000 bp - Final Sample Size = 6001210 bp ( 5937815 non ambiguous ) - Num Contigs Represented = 38 --? http://sjackman.ca/ On 2015-July-30 at 9:28:27 , Fields, Christopher J (cjfields at illinois.edu) wrote: Hi Shaun Ever get an answer on this one from the RepeatMasker folks? ?I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark?atp8?as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 17:21:53 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 16:21:53 -0700 Subject: [maker-devel] Short introns In-Reply-To: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: Hi, Carson. Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. Have you considered moving the MAKER development to GitHub? Thanks again. Cheers, Shaun diff --git a/protein.pm.orig b/protein.pm --- a/protein.pm.orig +++ b/protein.pm @@ -94,11 +94,11 @@ sub runExonerate { my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; $command .= "-m protein2genome --softmasktarget "; + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; $command .= " --percent $percent"; if ($matrix) { $command .= " --proteinsubmat $matrix"; } - $command .= " --showcigar "; $command .= " > $o_file"; my $w = new Widget::exonerate::protein2genome(); --? http://sjackman.ca/ On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com) wrote: I can add it to the development version. ?Carson On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the?$min_intron?parameter. Could this parameter be added to the?maker_opts.ctl?configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From arnstrm at gmail.com Wed Jul 1 08:42:53 2015 From: arnstrm at gmail.com (Arun Seetharam) Date: Wed, 1 Jul 2015 09:42:53 -0500 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist Message-ID: Hi all, I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides Once completed, I tried to create GFF file with the gff3_merge script: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, $ maker -base maker_R1 -dsindex STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... STATUS: Setting up database for any GFF3 input... A data structure will be created for you at: /data003/TIL/maker_R1.maker.output/maker_R1_datastore To access files for individual sequences use the datastore index: /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log When I tried again to run the gff3_merge: $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. Any help will greatly be appreciated. Thanks, -- Arun Seetharam -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 11:11:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:11:44 -0600 Subject: [maker-devel] gff3_merge error: FINISHED.gff does not exist In-Reply-To: References: Message-ID: Your datastore log is munged. Sometimes happen with an IO collision. Delete it and run maker on a single CPU using the -dsindex option. All it will do is rebuild the datastore log. Takes less than 5 minutes. ?Carson > On Jul 1, 2015, at 8:42 AM, Arun Seetharam wrote: > > Hi all, > > I am unable to generate the GFF file from the maker output. I originally ran maker with several processors (using mpi) as follows: > > /usr/lib64/mpich/bin/mpiexec -machinefile $PBS_NODEFILE -n 160 maker -base maker_R1 -fix_nucleotides > > Once completed, I tried to create GFF file with the gff3_merge script: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I searched online to find this thread which asked to build the index again if you ran mpi enalbled maker. so I ran it as, > > $ maker -base maker_R1 -dsindex > STATUS: Parsing control files... > STATUS: Processing and indexing input FASTA files... > STATUS: Setting up database for any GFF3 input... > A data structure will be created for you at: > /data003/TIL/maker_R1.maker.output/maker_R1_datastore > > To access files for individual sequences use the datastore index: > /data003/TIL/maker_R1.maker.output/maker_R1_master_datastore_index.log > > When I tried again to run the gff3_merge: > > $ gff3_merge -d maker_R1.maker.output/maker_R1_master_datastore_index.log > ERROR: The file 'maker_R1.maker.output/./FINISHED/FINISHED.gff' does not exist > > I still get the same error. Please tell me what am I doing wrong. My worst fear is that If I have to re-run it again, I will take another week or so. Both my genomes have the same issue. > > Any help will greatly be appreciated. > > Thanks, > > -- > Arun Seetharam > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jul 1 11:16:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 1 Jul 2015 11:16:15 -0600 Subject: [maker-devel] Best way to assemble RNA-seq for MAKER In-Reply-To: References: Message-ID: I would pool them. You will get better coverage of low expression transcripts. While there may be differently spliced transcripts among the tissues, MAKER and all gene prediction programs used by MAKER by default are not going to try and work out alternate splicing anyways. You can tell it to (altsplice= option), but your EST evidence has to be near perfect end-to-end for that to work. ?Carson > On Jun 29, 2015, at 4:45 PM, John Cornelius wrote: > > Hi I have a quick question, I have RNA-seq from several different tissue types and I was wondering, would it be better to pool them and assemble them as one large transcriptome? Or, should I assemble each tissue separately and then use MAKER to integrate the smaller assemblies into the annotation? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From eric.ganko at syngenta.com Mon Jul 6 09:37:28 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Mon, 6 Jul 2015 15:37:28 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome Message-ID: I'm hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I'm using an install of MAKER-P on the iForge system @ NCSA and I've successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn't processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don't have an enormous amount of supporting data- this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they've suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I'm not sure if MAKER is meant to run that way. I'd appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Tue Jul 7 21:59:47 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Tue, 7 Jul 2015 20:59:47 -0700 Subject: [maker-devel] Ability to process transcriptomes from different assemblers Message-ID: Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 9 12:55:38 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 12:55:38 -0600 Subject: [maker-devel] Ability to process transcriptomes from different assemblers In-Reply-To: References: Message-ID: <4A01AD3F-2CD6-4252-9A72-A3FA1E835CFB@gmail.com> You should be able to just supply both. ?Carson > On Jul 7, 2015, at 9:59 PM, John Cornelius wrote: > > Hello, I have several tissues that I have assembled with two different transcriptome assemblers and I was wondering, should I feed these different assemblies directly into MAKER? Or should I use something like evidence modeler to combine them first? Thanks. > > -- > John Cornelius > MCB PhD Candidate > Arizona State University > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Jul 9 13:01:13 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 9 Jul 2015 13:01:13 -0600 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: References: Message-ID: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson > On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE wrote: > > I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. > > I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : > > TOTAL: 25000 seqs > STARTED: 3594 > FINISHED: 2979 > FAILED: 10 > RETRY: 9 > DIED_SKIPPED_PERMANENT: 0 > SKIPPED_SMALL: 7635 > > While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? > > Thanks, > Eric > > > This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.ganko at syngenta.com Thu Jul 9 14:36:59 2015 From: eric.ganko at syngenta.com (Ganko Eric USRE) Date: Thu, 9 Jul 2015 20:36:59 +0000 Subject: [maker-devel] MAKER processing time in a 2Gb genome In-Reply-To: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> References: <8D11F342-8BD5-406E-B047-8CB0F2121CD3@gmail.com> Message-ID: Tuesday I ran the same option files, this time with 480 cores, and the annotation completed in ~6 hours. Perhaps I?m trying too many simultaneous writes at higher levels, or there is too much MPI communication as you mentioned? Thanks for the input on the RAM disk. -eric From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Thursday, July 09, 2015 3:01 PM To: Ganko Eric USRE Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] MAKER processing time in a 2Gb genome Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data. If you double the size of input datasets, then you double runtime. Also the assembly size doesn?t seem to have a large effect on runtime. It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes. For best MPI performance, you can submit multiple jobs with 200 CPUs or less. Over 200 CPUs per job tens to get limited throughput increases due to MPI communication overhead. I never use RAM disk. In general MAKER produces too many temporary files to fit in RAM. ?Carson On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE > wrote: I?m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I?m using an install of MAKER-P on the iForge system @ NCSA and I?ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH. I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn?t processed many sequences based on the master_datastore_index.log : TOTAL: 25000 seqs STARTED: 3594 FINISHED: 2979 FAILED: 10 RETRY: 9 DIED_SKIPPED_PERMANENT: 0 SKIPPED_SMALL: 7635 While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don?t have an enormous amount of supporting data? this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they?ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I?m not sure if MAKER is meant to run that way. I?d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions? Thanks, Eric ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 03:50:56 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 11:50:56 +0200 Subject: [maker-devel] Repeats... Message-ID: Hi guys, I have finished running Maker on my genome, but get >800 genes (out of ~20,000) that have similarity to transposases. Except from RepBase, have also built a species-specific repeat library, so it's weird that I still have quite a few transposases in my gene set... The repeat masking-related parameters in my maker-opts.ctl file are: model_org=all #select a model organism for RepBase masking in RepeatMasker rmlib=consensi.fa.classified #provide an organism specific repeat library in fasta format for RepeatMasker repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner rm_gff= #pre-identified repeat elements from an external GFF3 file prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) Does anyone have an idea why I'm getting so many transposases? Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Fri Jul 10 05:45:49 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Fri, 10 Jul 2015 13:45:49 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: An additional question related to the previous. I searched my species-specific repeat library with InterProScan and can't find a single sequence with similarity to a transposable element... I would expect it to find at least a few transposases. Is there an explanation for this, or has something gone wrong? Thanks, P On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis wrote: > Hi guys, > > I have finished running Maker on my genome, but get >800 genes (out of > ~20,000) that have similarity to transposases. Except from RepBase, have > also built a species-specific repeat library, so it's weird that I still > have quite a few transposases in my gene set... > > The repeat masking-related parameters in my maker-opts.ctl file are: > > model_org=all #select a model organism for RepBase masking in RepeatMasker > rmlib=consensi.fa.classified #provide an organism specific repeat library > in fasta format for RepeatMasker > repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta > #provide a fasta file of transposable element proteins for RepeatRunner > rm_gff= #pre-identified repeat elements from an external GFF3 file > prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change > this), 1 = yes, 0 = no > softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg > and dust filtering) > > Does anyone have an idea why I'm getting so many transposases? > > Thanks, > Panos > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Mon Jul 13 03:45:08 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Mon, 13 Jul 2015 11:45:08 +0200 Subject: [maker-devel] Repeats... In-Reply-To: References: Message-ID: Hi Daniel, Thanks for the reply. No, I'm not using Genemark. I didn't check for overlap with RepeatMasker elements or transcript/protein evidence though. But since it is such an unexpected finding, I decided to do something simpler. So I took all 750 transposases with the same InterPro annotation (IS4 family transposases) and clustered them with CD-HIT (amino acid sequences). At 90% similarity threshold each transposase goes to its own cluster. At 80% I get 748 clusters... This means that even though these transposases belong to the same family, they have diverged quite a bit, so that they're no longer considered "repeat elements". And this explains why they were not filtered out by RepeatMasker and made it to the final gene set. On Fri, Jul 10, 2015 at 5:00 PM, Daniel Ence wrote: > Hi Panos, Without knowing how you made the species-specific repeat > library, I can't speak to why it's giving hits against repbase. As to the > 800 transposases, are they overlapped by repeat masker elements? Are they > supported by EST or protein evidence? Are you using Genemark? That > ab-initio predictor runs on the unmasked genome sequence, so if the > transposases are present in your evidence set, they could still show up as > gene models. > > ~Daniel > > Sent from my iPhone > > On Jul 10, 2015, at 5:45 AM, Panos Ioannidis > wrote: > > An additional question related to the previous. > > I searched my species-specific repeat library with InterProScan and can't > find a single sequence with similarity to a transposable element... > > I would expect it to find at least a few transposases. Is there an > explanation for this, or has something gone wrong? > > Thanks, > P > > > On Fri, Jul 10, 2015 at 11:50 AM, Panos Ioannidis < > panos.ioannidis at gmail.com> wrote: > >> Hi guys, >> >> I have finished running Maker on my genome, but get >800 genes (out of >> ~20,000) that have similarity to transposases. Except from RepBase, have >> also built a species-specific repeat library, so it's weird that I still >> have quite a few transposases in my gene set... >> >> The repeat masking-related parameters in my maker-opts.ctl file are: >> >> model_org=all #select a model organism for RepBase masking in RepeatMasker >> rmlib=consensi.fa.classified #provide an organism specific repeat library >> in fasta format for RepeatMasker >> repeat_protein=/Home/pioannid/Programs/maker/data/te_proteins.fasta >> #provide a fasta file of transposable element proteins for RepeatRunner >> rm_gff= #pre-identified repeat elements from an external GFF3 file >> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change >> this), 1 = yes, 0 = no >> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg >> and dust filtering) >> >> Does anyone have an idea why I'm getting so many transposases? >> >> Thanks, >> Panos >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Wed Jul 15 18:36:22 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Wed, 15 Jul 2015 17:36:22 -0700 Subject: [maker-devel] Short introns Message-ID: Hi, Carson. I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 11:55:27 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 11:55:27 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: Message-ID: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; ?Carson > On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? > > 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. > > Thanks, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 13:11:32 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 12:11:32 -0700 Subject: [maker-devel] Short introns In-Reply-To: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Jul 16 14:10:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 16 Jul 2015 14:10:09 -0600 Subject: [maker-devel] Short introns In-Reply-To: References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> Message-ID: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> I can add it to the development version. ?Carson > On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: > > Hi, Carson. > > One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. > > I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. > > I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. > > I?ll try tweaking the $min_intron parameter. Could this parameter be added to the maker_opts.ctl configuration file? > > Thanks for your help, Carson. Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com ) wrote: > >> Look at the region. If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). >> >> The minimum intron given to exonerate for polishing is 20. It?s hard coded, and you would have to manually edit it. >> >> Line 1534 in maker/lib/GI.pm ?> my $min_intron = 20; >> >> ?Carson >> >> >>> On Jul 15, 2015, at 6:36 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I?m using protein evidence and protein2genome alone without ab initio gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? >>> >>> 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. >>> >>> Thanks, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 16 17:25:27 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 16 Jul 2015 16:25:27 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 10:40:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 10:40:53 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson > On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: > > Hi, Carson. > > I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. > > I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. > > Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 11:20:54 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 10:20:54 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs. I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:24:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:24:20 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson > On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: > > Hi, Carson. > > I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: > >> That is weird. >> >> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >> >> ?Carson >> >> >> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs. >>> >>> I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Jul 17 11:36:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 17 Jul 2015 11:36:46 -0600 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson > On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: > > Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. > > ?Carson > > > >> On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: >> >> Hi, Carson. >> >> I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? >> >> Cheers, >> Shaun >> >> >> >> >> -- >> http://sjackman.ca/ >> On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com ) wrote: >> >>> That is weird. >>> >>> One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. >>> >>> ?Carson >>> >>> >>> >>> On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: >>> >>> Hi, Carson. >>> >>> I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. >>> Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? >>> >>> Cheers, >>> Shaun >>> >>> >>> >>> >>> -- >>> http://sjackman.ca/ >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri Jul 17 14:29:49 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 17 Jul 2015 13:29:49 -0700 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Tue Jul 21 13:10:11 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Tue, 21 Jul 2015 12:10:11 -0700 Subject: [maker-devel] Cryptic ACG start codon Message-ID: Hi, Carson. I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. Cheers, Shaun --? http://sjackman.ca/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jul 21 16:28:09 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 21 Jul 2015 16:28:09 -0600 Subject: [maker-devel] Cryptic ACG start codon In-Reply-To: References: Message-ID: MAKER uses the is_start_codon method from Bio::Tools::CodonTable to determine if a codon is a valid start codon. Right now I don?t have a way to swap out the codon table. There is a way to do it, but it?s not easy. If you edit ?/maker/lib/CGL/TranslationMachine.pm line 122, you can set the table id to be another one from the BioPerl docs ?> http://doc.bioperl.org/releases/bioperl-1.6.1/Bio/Tools/CodonTable.html#BEGIN1 Or you can manually add your own codon table. It won?t change the codon usage for aligners like BLAST and Exonerate, but if will allow you to specify another valid start codon. To do that, edit line 118 to add your own manual codon table my adding another ?M? below the position you want to make into a valid start codon. my $id = $self->add_table( 'Strict', 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG', '-----------------------------------M----------------------------'); $self->id($id); I don?t really know which string position goes with which three letter nucleotide code. You might have to reverse engineer that from the BioPerl docs in the link above. ?Carson > On Jul 21, 2015, at 1:10 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m working with a plant mitochondrial genome that has a lot of C to U RNA editing. One effect of this editing is that AUG start codons can be created by editing ACG to AUG. Does MAKER have any particular support for cryptic start codons? > > I?m using protein evidence (protein and protein2genome), and a number of the protein sequences that I downloaded from Genbank start with a - character, which indicates a ACG start codon. It would fantastic if - were allowed to match either ACG or ATG in the genome. > > Cheers, > Shaun > > > > > -- > http://sjackman.ca/ > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From felix.bemm at uni-wuerzburg.de Mon Jul 27 07:46:44 2015 From: felix.bemm at uni-wuerzburg.de (Felix Bemm) Date: Mon, 27 Jul 2015 15:46:44 +0200 Subject: [maker-devel] Annotation of 32Mb pseudochromosome Message-ID: <55B63644.8020509@uni-wuerzburg.de> Hi, I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with DIED RANK 13:4:0:83 DIED COUNT 1 The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? Cheers, Felix -- University of W?rzburg, Department Bioinformatics Group Evolutionary Computational Biology Biocentre, 97074 W?rzburg, Germany From dence at genetics.utah.edu Mon Jul 27 10:15:29 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 27 Jul 2015 16:15:29 +0000 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <55B63644.8020509@uni-wuerzburg.de> References: <55B63644.8020509@uni-wuerzburg.de> Message-ID: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? ~Daniel Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: > > Hi, > > I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with > > DIED RANK 13:4:0:83 > DIED COUNT 1 > > The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. > > The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? > > Cheers, > Felix > > -- > University of W?rzburg, Department Bioinformatics > Group Evolutionary Computational Biology > Biocentre, 97074 W?rzburg, Germany > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Mon Jul 27 11:12:58 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 27 Jul 2015 11:12:58 -0600 Subject: [maker-devel] Annotation of 32Mb pseudochromosome In-Reply-To: <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> References: <55B63644.8020509@uni-wuerzburg.de> <13B84042-B9A7-448F-87E3-7D8379F70C1E@genetics.utah.edu> Message-ID: <5AB595A5-FDA9-4ED1-A21E-EE9F1D196E30@gmail.com> You can also try installing RepeatMasker with rmblast as the default aligner or hmmer as the default. That will alter it?s behavior. Also make sure you are using blast+ version 2.2.28. Do not use blast+ version 2.2.29 ?Carson > On Jul 27, 2015, at 10:15 AM, Daniel Ence wrote: > > Hi Felix, I think that idea bout the chunk sounds plausible, depending on how much of the sequence for that pseudo chromosome is N?s. Is there an error message besides the DIED for that pseudochromosome? > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Jul 27, 2015, at 7:46 AM, Felix Bemm wrote: >> >> Hi, >> >> I am trying to annotate a finished genome (122Mb) and maker loops forever over one of its pseudo chromosomes (32Mb). One of the child processes for that chromosomes dies with >> >> DIED RANK 13:4:0:83 >> DIED COUNT 1 >> >> The rb.out file contains: RepeatMasker quit because the file pseudochr_Chr1.82.arabidopsis.rb only contains ambiguous bases, if any. >> >> The assembly contains about 1,997,703 N's. Could it be that maker accidentally creates a 200kb chunk that is completely of N's and than crashes during repeat annotation? What about setting max_dna_len to something lik 25000000 and than try again? >> >> Cheers, >> Felix >> >> -- >> University of W?rzburg, Department Bioinformatics >> Group Evolutionary Computational Biology >> Biocentre, 97074 W?rzburg, Germany >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From cjfields at illinois.edu Thu Jul 30 10:27:45 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 30 Jul 2015 16:27:45 +0000 Subject: [maker-devel] MAKER and RepeatModeler In-Reply-To: References: Message-ID: <40D3E41C-0915-4B42-BF9C-DD779F2D5D06@illinois.edu> Hi Shaun Ever get an answer on this one from the RepeatMasker folks? I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman > wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt > wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman > wrote: Hi, Carson. I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea? Cheers, Shaun -- http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman > wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun -- http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 12:11:48 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 11:11:48 -0700 Subject: [maker-devel] MAKER and RepeatModeler Message-ID: Hi, Chris. Yes, I did get a response from the RepeatModeler author, Robert Hubley (cc?ed). There?s no public mailing list, as far as I know, so it?s all in private communication. Yes, RepeatModeler is non-deterministic. I suggested that the random seed be added as a parameter to RepeatModeler, and Robert agreed. I?m still not sure why the results were so variable (between 5 kbp and 30 kbp annotated as repeats, see table far below). Perhaps it?s because my genome is much smaller (6 Mbp) than the size of the random sample (40 Mbp) that RepeatModeler uses. See immediately below. Robert? Cheers, Shaun RepeatModeler Round # 1 ======================== Searching for Repeats -- Sampling from the database... - Gathering up to 40000000 bp - Final Sample Size = 6001210 bp ( 5937815 non ambiguous ) - Num Contigs Represented = 38 --? http://sjackman.ca/ On 2015-July-30 at 9:28:27 , Fields, Christopher J (cjfields at illinois.edu) wrote: Hi Shaun Ever get an answer on this one from the RepeatMasker folks? ?I?ve seen (and expect) non-deterministic results from a few tools but the results shouldn?t change *that* dramatically. chris On Jul 17, 2015, at 3:29 PM, Shaun Jackman wrote: Hi, Carson. It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark?atp8?as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What?s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don?t see how I can implement a reproducible pipeline with the situation as it is. This has become a RepeatModeler question more than a MAKER question, but I thought I?d continue this thread that I?d started here. n n:1 L50 min N80 N50 N20 E-size max sum name 6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa 6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa 6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa 10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa 8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa My command line is BuildDatabase -name x -engine ncbi x.fa RepeatModeler -database x cp -a RM_*/consensi.fa.classified RepeatModeler.fa I installed the following software using Homebrew on a Mac. repeatmodeler 1.0.8 recon 1.07 repeatmasker 4.0.5 repeatscout 1.0.5 rmblast 2.2.28 trf 4.07b Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote: The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn?t match the internal taxonomy, it throughs an error. ?Carson On Jul 17, 2015, at 11:24 AM, Carson Holt wrote: Yes. It takes a the subset of RepBase. If runtime isn?t an issue and you really want to mask as much as possible, you can also set?model_org=all. ?Most of whatever else is in RepBase probably won?t align anywhere, but it may give you marginally better sensitivity. ?Carson On Jul 17, 2015, at 11:20 AM, Shaun Jackman wrote: Hi, Carson. I set?model_org=picea. I see that it created a new data base in the RepeatModeler folder?Libraries/20140131/picea/specieslib. What is the effect of the?model_org?option? Does it extract sequences from RepBase that match the string?picea? Cheers, Shaun --? http://sjackman.ca/ On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote: That is weird. One thought though. ?When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). ?That way some of the edge cases might better be identified. ?Carson On Jul 16, 2015, at 5:25 PM, Shaun Jackman wrote: Hi, Carson. I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same?atp8?gene, but these genes were not in the contaminant contigs.I figured out that it?s because I?m running RepeatModeler to create the?rmlib?for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene?atp8?as being a LTR/Gypsy repeat. Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats? Cheers, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Thu Jul 30 17:21:53 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Thu, 30 Jul 2015 16:21:53 -0700 Subject: [maker-devel] Short introns In-Reply-To: <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> References: <8B3943E4-1D25-4C54-8A35-E64AC4A93689@gmail.com> <6B98C302-063B-4BAF-8F6E-2EDE9D8CBDA3@gmail.com> Message-ID: Hi, Carson. Increasing $min_intron to 250 did the trick! Thanks for the tip. For this organellar genome, all the introns (at least that I?ve found) are group II self-splicing introns, so they have a much larger minimum size (about 990 bp) than introns in other genomes. https://en.wikipedia.org/wiki/Group_II_intron Increasing $min_intron worked perfectly for all but one gene. Decreasing --codongapopen and --codongapextend rescued that one gene, but I?ll probably just leave the defaults and annotate that one gene by hand. Note that $min_intron had no effect on protein2genome without the following patch. Will there be a stable release soon that includes this patch and the min_intron option? I?m preparing a manuscript for submission, and I?d love to be able to refer to a stable version of MAKER in the manuscript that includes this feature. Have you considered moving the MAKER development to GitHub? Thanks again. Cheers, Shaun diff --git a/protein.pm.orig b/protein.pm --- a/protein.pm.orig +++ b/protein.pm @@ -94,11 +94,11 @@ sub runExonerate { my $command = "$exe -q $q_file -t $t_file -Q protein -T dna "; $command .= "-m protein2genome --softmasktarget "; + $command .= " --minintron $min_intron --maxintron $max_intron --showcigar"; $command .= " --percent $percent"; if ($matrix) { $command .= " --proteinsubmat $matrix"; } - $command .= " --showcigar "; $command .= " > $o_file"; my $w = new Widget::exonerate::protein2genome(); --? http://sjackman.ca/ On 2015-July-16 at 13:10:13 , Carson Holt (carsonhh at gmail.com) wrote: I can add it to the development version. ?Carson On Jul 16, 2015, at 1:11 PM, Shaun Jackman wrote: Hi, Carson. One of the ten questionable introns has a canonical GT-AG splice site and is 33 bp. The splice sites are GA-AG, GG-GG, GC-AG, GT-CG, GA-AT, GG-AA, GG-AG, AT-TT, GG-AT and GT-AG. The intron sizes are 33, 111, 84, 30, 219, 186, 51, 30, 45 and 33. I was wrong about there being stop codons in the questionable introns. All ten are completely free of stop codons. Sorry for the confusion. I had extracted just the intron sequence and translated the first frame, but the intron was not aligned to a 3-nucleotide boundary. I am convinced that these short introns are in fact genomic insertions and not introns. The root cause may be incorrectly annotated introns in the protein evidence, as you suggest. I?ll try tweaking the?$min_intron?parameter. Could this parameter be added to the?maker_opts.ctl?configuration file? Thanks for your help, Carson. Cheers, Shaun --? http://sjackman.ca/ On 2015-July-16 at 10:55:32 , Carson Holt (carsonhh at gmail.com) wrote: Look at the region. ?If it?s being suggested by the polished alignment, I somewhat doubt it?s just an insertion because the polished alignment will have valid splice sites. ?It could be an insertion, but one that perfectly maps around canonical splice sites would be quite the coincidence (because exonerate shouldn't make big gaps to force the alignment). However if it looks more like a forced mapping around non-canonical spice sites (which shouldn?t actually produce protein2genome results) then I might support the idea that it?s an insertion. ?A 250bp intron or even a 100 bp doesn?t really seem that short to me. The lower range seen in fungi for example (which have very short introns) can get close to about 20bp. ?I guess it?s possible that the protein evidence you are using contains an intron that isn?t really there, that results in an intron in your job because of protein conservation (i.e. conserved codons contain the falsely used splice site). The minimum intron given to exonerate for polishing is 20. ?It?s hard coded, and you would have to manually edit it. Line 1534 in maker/lib/GI.pm ?> my?$min_intron?= 20;? ?Carson On Jul 15, 2015, at 6:36 PM, Shaun Jackman wrote: Hi, Carson. I?m using?protein?evidence and?protein2genome?alone without?ab initio?gene finders to annotate an organellar genome. MAKER annotates 16 introns. 6 introns look real (according to RNAweasel) and are all larger than 900 bp. The other 10 introns are all shorter than 250 bp and multiples of 3 bp. These short introns look like genomic insertions rather than introns to me. Is there a way to specify a minimum intron size to MAKER? 6 of these short introns do not contain a stop codon, and 4 do contain a stop codon. I suppose these 4 are pseudogenes. Thanks, Shaun --? http://sjackman.ca/ _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: