From sjackman at gmail.com Fri May 1 14:34:10 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 01 May 2015 19:34:10 +0000 Subject: [maker-devel] Other GFF not passed through Message-ID: Hi, Carson. I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. Cheers, Shaun ##gff-version 3 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 15:22:57 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:22:57 -0600 Subject: [maker-devel] Other GFF not passed through In-Reply-To: References: Message-ID: <9BC05C91-8960-404F-9C9C-C17BCD7C844F@gmail.com> gff3_merge is expecting to work with maker output, and the -g option specifically looks for maker produced genes (maker source tag). Since you added these lines using the other_gff option, they are in the file, but it doesn?t necessarily mean downstream maker tools will know what to do with them because maker added them blindly without attempting any interpretation/validation, etc. I purposely don?t try and make these tools support any GFF3 input possible, it just gets too hairy. What you can do though is grep the features you want out separately into another GFF3 file and then you can use gff3_merge to combine those two files. grep -P ?\tbarrnap:0.5\t? infile.gff > barnap.gff gff3_merge -s maker.gff barnap.gff > new.gff Thanks, Carson > On May 1, 2015, at 1:34 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. > > Cheers, > Shaun > > ##gff-version 3 > 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA > 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA > 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA > 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 15:54:40 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:54:40 -0600 Subject: [maker-devel] Why would MAKER generate a large gff3? In-Reply-To: References: Message-ID: <1111BFE1-EE7D-4754-BD03-24F5EE66CB1F@gmail.com> > On May 1, 2015, at 2:53 PM, Carson Holt wrote: > > There is probably something weird about the way you are running it. For example did you give it the raw RNA-seq reads instead of the assembled reads? > > The total size of the final GFF3 will be all genes + all evidence alignments + the assembly fasta (concatenated at the end per GFF3 format specifications). You can remove the fasta or the evidence alignments from the file using the options found in gff3_merge. > > ?Carson > >> >> From: John Cornelius > >> Subject: Why would MAKER generate a large gff3? >> Date: May 1, 2015 at 2:46:45 PM MDT >> To: maker-devel at yandell-lab.org >> >> >> Hello, I'm using MAKER to generate a new annotation for an organism without an officially published genome (but it does exist and I'm using it). The current annotation is primarily predictive and I'm adding RNA-Seq evidence to improve it. However, after the initial run and the following two runs with SNAP, the gff3 file generated is 24 GB in size while the old annotation file is only 82 Mb. Should is be that large? Also, what is the best way to analyze a new annotation to figure out if it is actually in decent shape? Thanks. >> >> -- >> John Cornelius >> MCB PhD Candidate >> Arizona State University >> >> >> >> From: maker-devel-request at yandell-lab.org >> Subject: confirm d8a6466a8a63fe2312cc2ce7f79739414020644e >> Date: May 1, 2015 at 2:47:11 PM MDT >> >> >> If you reply to this message, keeping the Subject: header intact, >> Mailman will discard the held message. Do this if the message is >> spam. If you reply to this message and include an Approved: header >> with the list password in it, the message will be approved for posting >> to the list. The Approved: header can also appear in the first line >> of the body of the reply. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From muriel.grosb at gmail.com Mon May 4 07:37:26 2015 From: muriel.grosb at gmail.com (Muriel Gros-Balthazard) Date: Mon, 4 May 2015 14:37:26 +0200 Subject: [maker-devel] Missing files for some contains Message-ID: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Hello, After running Maker, I have many directories since I have many contigs. Each directory contains these files : .gff .maker.augustus.transcripts.fasta .maker.augustus_masked.proteins.fasta .maker.augustus.proteins.fasta .maker.augustus_masked.transcripts.fasta .maker.transcripts.fasta .maker.trnascan.transcripts.fasta .maker.proteins.fasta .maker.non_overlapping_ab_initio.transcripts.fasta .maker.non_overlapping_ab_initio.proteins.fasta run.log and the directory theVoid. However, for some contigs, one or several files are missing. I have more than 50% of contig directory missing the trnascan file. Some are missing also the two files ? ..non overlapping.. ? Some miss even more. However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. So, my questions are : - why are those files missing ? - is it problematic ? Does it mean something didn?t work well ? - should I rerun Maker on these contigs ? Thank you ! Muriel -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 4 09:14:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 May 2015 08:14:36 -0600 Subject: [maker-devel] Missing files for some contains In-Reply-To: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> References: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Message-ID: If there are no trnasscan results for the contig then there will be no trnascan fasta. The same is true for each of the other feature types. ?Carson > On May 4, 2015, at 6:37 AM, Muriel Gros-Balthazard wrote: > > Hello, > > After running Maker, I have many directories since I have many contigs. > > Each directory contains these files : > .gff > .maker.augustus.transcripts.fasta > .maker.augustus_masked.proteins.fasta > .maker.augustus.proteins.fasta > .maker.augustus_masked.transcripts.fasta > .maker.transcripts.fasta > .maker.trnascan.transcripts.fasta > .maker.proteins.fasta > .maker.non_overlapping_ab_initio.transcripts.fasta > .maker.non_overlapping_ab_initio.proteins.fasta > run.log > and the directory theVoid. > > However, for some contigs, one or several files are missing. > I have more than 50% of contig directory missing the trnascan file. > Some are missing also the two files ? ..non overlapping.. ? > Some miss even more. > > However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. > So, my questions are : > - why are those files missing ? > - is it problematic ? Does it mean something didn?t work well ? > - should I rerun Maker on these contigs ? > > Thank you ! > > Muriel > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From solschech at gmail.com Tue May 5 02:41:18 2015 From: solschech at gmail.com (Sunny Sun) Date: Tue, 5 May 2015 09:41:18 +0200 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: Hi, I am trying to annotate with Maker a set of 7k scaffolds with a genome size of 160Mb. The first run returned 10% of scaffolds FAILED and the remaining FINISHED but I didn't get the protein or transcripts fasta files so I modified the configuration files accordingly and I am rerunning the analysis. So far, all the scaffolds are failing, in the error.log the only error I see is of this type for all scaffolds: error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod ERROR: GeneMark Failed ERROR: Genemark failed --> rank=NA, hostname=sol ERROR: Failed while preparing ab-inits ERROR: Chunk failed at level:0, tier_type:2 FAILED CONTIG:scaffold_6 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold_6 examining contents of the fasta file and run log which doesn't tell me much. I attached the config files. Can someone see what is wrong? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1412 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1535 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4866 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Tue May 5 10:47:39 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 5 May 2015 09:47:39 -0600 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: It looks like you are giving the training file from SNAP to genemark, and that would cause a failure. Try running it with just SNAP and augustus and see if that fixes the problem. Thanks, Mike On Tue, May 5, 2015 at 1:41 AM, Sunny Sun wrote: > > Hi, > I am trying to annotate with Maker a set of 7k scaffolds with a genome > size of 160Mb. The first run returned 10% of scaffolds FAILED and the > remaining FINISHED but I didn't get the protein or transcripts fasta files > so I modified the configuration files accordingly and I am rerunning the > analysis. So far, all the scaffolds are failing, in the error.log the only > error I see is of this type for all scaffolds: > > error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod > ERROR: GeneMark Failed > ERROR: Genemark failed > --> rank=NA, hostname=sol > ERROR: Failed while preparing ab-inits > ERROR: Chunk failed at level:0, tier_type:2 > FAILED CONTIG:scaffold_6 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold_6 > > examining contents of the fasta file and run log > > which doesn't tell me much. I attached the config files. Can someone see > what is wrong? > > Thanks > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Fri May 8 16:54:48 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Fri, 8 May 2015 21:54:48 +0000 Subject: [maker-devel] creating fasta ids Message-ID: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 11 12:41:10 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 May 2015 11:41:10 -0600 Subject: [maker-devel] creating fasta ids In-Reply-To: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Message-ID: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson > On May 8, 2015, at 3:54 PM, Craig Coleman wrote: > > Hi, > I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. > > Craig Coleman > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Tue May 12 14:15:59 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Tue, 12 May 2015 19:15:59 +0000 Subject: [maker-devel] creating fasta ids In-Reply-To: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Message-ID: <71683e53ac2a481895c7a50925197223@MB10.byu.local> Thank you. Worked perfectly. Craig From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Monday, May 11, 2015 11:41 AM To: Craig Coleman Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] creating fasta ids Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson On May 8, 2015, at 3:54 PM, Craig Coleman > wrote: Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Tue May 12 16:56:46 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 17:56:46 -0400 Subject: [maker-devel] why no prediction Message-ID: Hi guys, I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. [image: ???? 2] color means: pink: Augustus light green: SNAP dark pink: pred_gff light yellow: cufflinks darkest pink: EST alignment dark yellow: protein alignment In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. Could anyone know the reason? Thanks very much! Best, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 15364 bytes Desc: not available URL: From carsonhh at gmail.com Tue May 12 17:16:33 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:16:33 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: Message-ID: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. ?Carson > On May 12, 2015, at 3:56 PM, ??? wrote: > > Hi guys, > > I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > > > > color means: > pink: Augustus > light green: SNAP > dark pink: pred_gff > light yellow: cufflinks > darkest pink: EST alignment > dark yellow: protein alignment > > In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > > Could anyone know the reason? > > Thanks very much! > > Best, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue May 12 17:18:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:18:53 -0600 Subject: [maker-devel] why no prediction In-Reply-To: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> Message-ID: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. ?Carson > On May 12, 2015, at 4:16 PM, Carson Holt wrote: > > The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > > ?Carson > > > >> On May 12, 2015, at 3:56 PM, ??? wrote: >> >> Hi guys, >> >> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >> >> >> >> color means: >> pink: Augustus >> light green: SNAP >> dark pink: pred_gff >> light yellow: cufflinks >> darkest pink: EST alignment >> dark yellow: protein alignment >> >> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >> >> Could anyone know the reason? >> >> Thanks very much! >> >> Best, >> Wenbo >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From myandell at genetics.utah.edu Tue May 12 19:31:33 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 13 May 2015 00:31:33 +0000 Subject: [maker-devel] why no prediction In-Reply-To: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? On May 12, 2015, at 4:18 PM, Carson Holt wrote: > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > ?Carson > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: >> >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. >> >> ?Carson >> >> >> >>> On May 12, 2015, at 3:56 PM, ??? wrote: >>> >>> Hi guys, >>> >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >>> >>> >>> >>> color means: >>> pink: Augustus >>> light green: SNAP >>> dark pink: pred_gff >>> light yellow: cufflinks >>> darkest pink: EST alignment >>> dark yellow: protein alignment >>> >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >>> >>> Could anyone know the reason? >>> >>> Thanks very much! >>> >>> Best, >>> Wenbo >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Tue May 12 20:06:42 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 21:06:42 -0400 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: Thank you for the help I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. Thanks, Wenbo 2015-05-12 20:31 GMT-04:00 Mark Yandell : > and finally check the splice sites for the EST splice are they valid > GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt wrote: > > > Also protein evidence will only be considered as support if it is in the > same reading frame as the ab initio prediction. Complete mismatch of > reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction > randomly overlapped by a spurious EST alignment. You would need at least > protein evidence overlap to make it believable. There is heavy discordance > among the gene predictors. Also the fact that the gene would be 90% plus > UTR if the EST does in fact represent true expression is a big factor. > More likely it?s a pseudogene or semi repetitive region. Not making this a > gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but > no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus > and pred_gff, also evidences from cufflinks, why there is no gene model > generated? I could find the gene model in the > "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is > wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 12 21:09:56 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 20:09:56 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: <0E75FA8B-61F3-4FC2-9B77-7860EA2B14E2@gmail.com> Hi Wenbo, You will actually get more gene calls from gene predictors than there are genes (often orders of magnitude more) because workable ORFs are common in a genome. So having a single exon ORF predicted is not really that noteworthy. You can expect those kind of predictions to outnumber the true gene count by as much as 10 to 1 in some genomes. The problem with the region you are showing is that it doesn?t look like a gene. Even without a more detailed look at the coordinates and evidence overlap, the image lacks the structure for evidence and prediction concordance than would be expected in a genic region. Without some form of additional evidence like a good protein match, it is just too much like a lot of spurious overlap regions that you would expect to find randomly throughout a genome. Given this, there is just not enough support to promote the region to being a gene. The predictions are still there in the output for reference purposes, but will not be promoted to gene because the evidence support is insufficient. Looking at this region, there are not good gene predictions from snap, augustus, or your pred_gff either (poor concordance). The heavy discordance among the different gene predictors suggests, they have not been sufficiently trained. One thing that can affect evidence alignment and gene predictor performance is insufficient masking of repeat elements. You may need to spend some time building a species specific repeat database using tools like RepeatModeler. Other issues that will have an affect are stretches of N?s in the sequence. You will get poor evidence alignments and predictions in what appears to be a large contig if there isn?t enough continuous usable sequence. I mention all these factors, because the region in question looks spurious and unordered. Lack of concordance in clustering patters generally means there are other structural issues with the dataset being used. I?ve attached an image below to give an example. Notice how in regions with genes the different evidence types build on each other and have remarkable concordance (SNAP and Augustus choose very similar exon patterns for example). Regions without genes still have aligned evidence from Trinity assembled mRNA-seq and ab initio gene predictors, but they are not concordant, are more spurious in nature, and can be found on both strands. Simple overlap is insufficient to generate a gene call. You have to consider the totality of evidence. Thanks, Carson > On May 12, 2015, at 7:06 PM, ??? wrote: > > Thank you for the help > > I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. > > Thanks, > Wenbo > > > > 2015-05-12 20:31 GMT-04:00 Mark Yandell >: > and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt > wrote: > > > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt > wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? > wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.png Type: image/png Size: 51295 bytes Desc: not available URL: From julian.egger at omahazoo.com Thu May 14 13:17:50 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Thu, 14 May 2015 18:17:50 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertbrutzel at googlemail.com Fri May 15 02:08:42 2015 From: bertbrutzel at googlemail.com (Bert Brutzel) Date: Fri, 15 May 2015 09:08:42 +0200 Subject: [maker-devel] Genbank submission Message-ID: <55559B7A.2080906@gmail.com> Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert From robert.king at rothamsted.ac.uk Fri May 15 10:12:34 2015 From: robert.king at rothamsted.ac.uk (Robert King) Date: Fri, 15 May 2015 15:12:34 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> References: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> Message-ID: <136AB40E0C34CF4FB9AE0DD8C22A8D7B7F484D@rothex1.rothamsted.ac.uk> Get the GFF file ready from maker and the fasta file. I then edit in geneious and export as embl format but we pay for this so you may not have but if got your end gff file by whatever means then use seqret to convert too. https://www.biostars.org/p/72220/ Not submitted to ncbi because I submit to ENA and they have a special header for embl format which means have to edit before submitting to them. Rob -----Original Message----- From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Bert Brutzel Sent: 15 May 2015 08:09 To: maker-devel at yandell-lab.org Subject: [maker-devel] Genbank submission Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 Come and join Rothamsted Research scientists for the Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 10am to 5pm. Please take a moment to view a video of all that is in store: http://www.rothamsted.ac.uk/news-views/rothamsted-research-presents-soil-life-research-exhibition-day Rothamsted Research is a company limited by guarantee, registered in England at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a not for profit charity number 802038. From carsonhh at gmail.com Fri May 15 10:30:47 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:30:47 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> Message-ID: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson > On May 14, 2015, at 12:17 PM, Julian Egger wrote: > > We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 15 10:48:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:48:20 -0600 Subject: [maker-devel] Genbank submission In-Reply-To: <55559B7A.2080906@gmail.com> References: <55559B7A.2080906@gmail.com> Message-ID: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson > On May 15, 2015, at 1:08 AM, Bert Brutzel wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjfields at illinois.edu Sat May 16 10:51:49 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 16 May 2015 15:51:49 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> References: <55559B7A.2080906@gmail.com> <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722@illinois.edu> We?ve been using GAG (mentioned in this thread), though with some fiddling. I have heard that ENA has a much easier submission process. chris On May 15, 2015, at 10:48 AM, Carson Holt > wrote: Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From smg283 at gmail.com Sat May 16 23:11:21 2015 From: smg283 at gmail.com (Scott Geib) Date: Sat, 16 May 2015 18:11:21 -1000 Subject: [maker-devel] maker-devel Digest, Vol 84, Issue 8 In-Reply-To: References: Message-ID: If anyone has bugs or suggestions for gag, let us know and we can modify. Right now we are fixing some bugs and applying to new dataset, so good time to add anything people might find useful. email myself or Brian ( bhall7 at hawaii.edu) Thanks, Scott On Sat, May 16, 2015 at 8:00 AM, wrote: > Send maker-devel mailing list submissions to > maker-devel at yandell-lab.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > or, via email, send a message with subject or body 'help' to > maker-devel-request at yandell-lab.org > > You can reach the person managing the list at > maker-devel-owner at yandell-lab.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of maker-devel digest..." > > > Today's Topics: > > 1. Re: Genbank submission (Fields, Christopher J) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 16 May 2015 15:51:49 +0000 > From: "Fields, Christopher J" > To: Carson Holt > Cc: "maker-devel at yandell-lab.org" , > Bert > Brutzel > Subject: Re: [maker-devel] Genbank submission > Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722 at illinois.edu> > Content-Type: text/plain; charset="utf-8" > > We?ve been using GAG (mentioned in this thread), though with some > fiddling. I have heard that ENA has a much easier submission process. > > chris > > On May 15, 2015, at 10:48 AM, Carson Holt carsonhh at gmail.com>> wrote: > > Here is an archived thread on this that might be useful as well ?> > > > https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ > > ?Carson > > > On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. > sed+awk+GAG....) but I simply run into to many problems? I as well tried to > load the data into a chado, but this took over two weeks and exited with > errors. Maybe someone who already submitted their MAKER annotated genome to > Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150516/672a1386/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > ------------------------------ > > End of maker-devel Digest, Vol 84, Issue 8 > ****************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 08:53:40 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 13:53:40 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST's on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST's from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don't use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won't get anything that you couldn't have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don't use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon May 18 09:38:15 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 18 May 2015 14:38:15 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Message-ID: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 10:08:45 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:08:45 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Message-ID: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson > On May 18, 2015, at 8:38 AM, Daniel Ence wrote: > > Hi Julian, > > The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. > > The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. > > I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. > > I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. > > Let me know if that helps, > Daniel > > > > >> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >> >> Hi Carson, >> >> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >> >> Thanks, >> >> Julian >> >> From: Carson Holt [mailto:carsonhh at gmail.com ] >> Sent: Friday, May 15, 2015 10:31 AM >> To: Julian Egger >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >> >> Hi Julian, >> >> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >> >> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >> >> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >> >> Thanks, >> Carson >> >> >> >> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >> >> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >> >> Thanks >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 10:16:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:16:59 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> Message-ID: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson > On May 18, 2015, at 9:08 AM, Carson Holt wrote: > > If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. > > ?Carson > > >> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >> >> Hi Julian, >> >> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >> >> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >> >> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >> >> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >> >> Let me know if that helps, >> Daniel >> >> >> >> >>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>> >>> Hi Carson, >>> >>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>> >>> Thanks, >>> >>> Julian >>> >>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>> Sent: Friday, May 15, 2015 10:31 AM >>> To: Julian Egger >>> Cc: maker-devel at yandell-lab.org >>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>> >>> Hi Julian, >>> >>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>> >>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>> >>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>> >>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>> >>> Thanks >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 10:17:46 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 15:17:46 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com>, <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? Thanks again, Julian ________________________________ From: Carson Holt [carsonhh at gmail.com] Sent: Monday, May 18, 2015 10:16 AM To: Julian Egger Cc: maker-devel at yandell-lab.org; Daniel Ence Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson On May 18, 2015, at 9:08 AM, Carson Holt > wrote: If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 10:31:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:31:36 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Message-ID: <36E62C76-8F05-4BA5-8CEE-91E68A08FB79@gmail.com> You have to have protein evidence from some source. Preferably at least two somewhat related organisms. Proteins take a while to align (amino acid alignment is computationally intensive), EST?s not so much. ?Carson > On May 18, 2015, at 9:17 AM, Julian Egger wrote: > > Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? > > Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? > > Thanks again, > > Julian > From: Carson Holt [carsonhh at gmail.com ] > Sent: Monday, May 18, 2015 10:16 AM > To: Julian Egger > Cc: maker-devel at yandell-lab.org ; Daniel Ence > Subject: Re: [maker-devel] Non-redundant Reference Human EST Data > > Best sources ?> > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ > > ?Carson > > > >> On May 18, 2015, at 9:08 AM, Carson Holt > wrote: >> >> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. >> >> ?Carson >> >> >>> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >>> >>> Hi Julian, >>> >>> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >>> >>> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >>> >>> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >>> >>> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >>> >>> Let me know if that helps, >>> Daniel >>> >>> >>> >>> >>>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>>> >>>> Hi Carson, >>>> >>>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>>> >>>> Thanks, >>>> >>>> Julian >>>> >>>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>>> Sent: Friday, May 15, 2015 10:31 AM >>>> To: Julian Egger >>>> Cc: maker-devel at yandell-lab.org >>>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>>> >>>> Hi Julian, >>>> >>>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>>> >>>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>>> >>>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>>> >>>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Tue May 19 14:51:54 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Tue, 19 May 2015 19:51:54 +0000 Subject: [maker-devel] Using Augustus with MAKER Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=1 protein2genome=1 I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? Sorry for all the questions, newbie here with a lot of data to work with. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Tue May 19 16:18:43 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 19 May 2015 15:18:43 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: Hi Julian, Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=0 protein2genome=0 augustus_species=human You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract Good luck, Mike On Tue, May 19, 2015 at 1:51 PM, Julian Egger wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes > as possible from genomic reads of a primate sample. I am new to using gene > prediction tools such as SNAP and Augustus, but was told Augustus would be > better for primates. I tried using reference mRNAs and protein sequences > from NCBI on the sample contig file included with the MAKER software and it > ran ok. My question is how do I now use the output to train Augustus > iteratively and thus create a file set of annotations from my original > input? > > After creating the control files with maker -CTL, the only configurations > I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the > assembly. I know the output created a gff file along with protein and mRNA > files. Do I then need to change the maker_opts file to account for the new > files and if so how and what should the maker__opts file look like now? > Was Augustus supposed to be set up on the initial maker run or do I wait > until the second run after est2genome and protein2genome were used to > initialize training for Augustus and how do the configurations change > between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 19 16:48:29 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 May 2015 15:48:29 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: <6B295FB1-46C8-44B7-A816-66DF6F45D3E0@gmail.com> A couple of corrections from the reply below. SNAP doesn?t work well on primates, so you probably don?t want to use it (the mammal hmm is not a good replacement). This suggestion comes directly from the author of SNAP. There are ways to make it work by splitting the genome into isotigs but it?s a little messy and technical, so just don?t use it on primates. Here?s a good website on training Augustus (http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ). You need some sort of results to train with. You can either use results from a protein2genome run of MAKER or a run where you use human as your species together with other evidence in MAKER (models won?t be perfect but will be enough to get training going). Unless it?s really really close evolutionarily to human, you probably don?t just want to stick to the human species file (this is because your not going to want to use SNAP, so you will need to optimize the one gene predictor you will get to use as much as possible). You need models to be in GeneBank format for training. There is a round about way to do this with GFF3 models. First use the scripts that come with MAKER for training SNAP (makerr2zff). Then follow SNAP?s training instructions on training SNAP (in SNAP?s README). Basically the following commands (where the first two files came from maker2zff) ?> fathom genome.ann genome.dna -categorize 1000 fathom uni.ann uni.dna -export 1000 -plus Then using this script from Jason Stajich, you can convert it to the export.ann and export.dna files to a genebank format file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Go ahead and run with human as your species first, so you can review models and see how models and evidence correlating in a viewer like Apollo or IGV. But I still would recommend training Augustus to your species. ?Carson > On May 19, 2015, at 3:18 PM, Michael Campbell wrote: > > Hi Julian, > > Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first > > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=0 > protein2genome=0 > augustus_species=human > > You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page > > There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. > > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > Good luck, > Mike > > On Tue, May 19, 2015 at 1:51 PM, Julian Egger > wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? > > After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Michael Campbell MS, RD. > Doctoral Candidate > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ph:585-3543 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Wed May 27 17:57:35 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Wed, 27 May 2015 15:57:35 -0700 Subject: [maker-devel] Training Augustus Message-ID: Hi all, I'm trying to train augustus with a non-model organism, I've run Maker, then trained and run SNAP twice and would now like to run Augustus on the results as well. I've seen the Augustus page on training the program and it mentioned needing a list of 200+ quality gene structures for training, is there a way that I could filter the SNAP results for the highest quality genes to feed into augustus? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri May 1 13:34:10 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 01 May 2015 19:34:10 +0000 Subject: [maker-devel] Other GFF not passed through Message-ID: Hi, Carson. I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. Cheers, Shaun ##gff-version 3 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 14:22:57 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:22:57 -0600 Subject: [maker-devel] Other GFF not passed through In-Reply-To: References: Message-ID: <9BC05C91-8960-404F-9C9C-C17BCD7C844F@gmail.com> gff3_merge is expecting to work with maker output, and the -g option specifically looks for maker produced genes (maker source tag). Since you added these lines using the other_gff option, they are in the file, but it doesn?t necessarily mean downstream maker tools will know what to do with them because maker added them blindly without attempting any interpretation/validation, etc. I purposely don?t try and make these tools support any GFF3 input possible, it just gets too hairy. What you can do though is grep the features you want out separately into another GFF3 file and then you can use gff3_merge to combine those two files. grep -P ?\tbarrnap:0.5\t? infile.gff > barnap.gff gff3_merge -s maker.gff barnap.gff > new.gff Thanks, Carson > On May 1, 2015, at 1:34 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. > > Cheers, > Shaun > > ##gff-version 3 > 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA > 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA > 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA > 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 14:54:40 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:54:40 -0600 Subject: [maker-devel] Why would MAKER generate a large gff3? In-Reply-To: References: Message-ID: <1111BFE1-EE7D-4754-BD03-24F5EE66CB1F@gmail.com> > On May 1, 2015, at 2:53 PM, Carson Holt wrote: > > There is probably something weird about the way you are running it. For example did you give it the raw RNA-seq reads instead of the assembled reads? > > The total size of the final GFF3 will be all genes + all evidence alignments + the assembly fasta (concatenated at the end per GFF3 format specifications). You can remove the fasta or the evidence alignments from the file using the options found in gff3_merge. > > ?Carson > >> >> From: John Cornelius > >> Subject: Why would MAKER generate a large gff3? >> Date: May 1, 2015 at 2:46:45 PM MDT >> To: maker-devel at yandell-lab.org >> >> >> Hello, I'm using MAKER to generate a new annotation for an organism without an officially published genome (but it does exist and I'm using it). The current annotation is primarily predictive and I'm adding RNA-Seq evidence to improve it. However, after the initial run and the following two runs with SNAP, the gff3 file generated is 24 GB in size while the old annotation file is only 82 Mb. Should is be that large? Also, what is the best way to analyze a new annotation to figure out if it is actually in decent shape? Thanks. >> >> -- >> John Cornelius >> MCB PhD Candidate >> Arizona State University >> >> >> >> From: maker-devel-request at yandell-lab.org >> Subject: confirm d8a6466a8a63fe2312cc2ce7f79739414020644e >> Date: May 1, 2015 at 2:47:11 PM MDT >> >> >> If you reply to this message, keeping the Subject: header intact, >> Mailman will discard the held message. Do this if the message is >> spam. If you reply to this message and include an Approved: header >> with the list password in it, the message will be approved for posting >> to the list. The Approved: header can also appear in the first line >> of the body of the reply. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From muriel.grosb at gmail.com Mon May 4 06:37:26 2015 From: muriel.grosb at gmail.com (Muriel Gros-Balthazard) Date: Mon, 4 May 2015 14:37:26 +0200 Subject: [maker-devel] Missing files for some contains Message-ID: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Hello, After running Maker, I have many directories since I have many contigs. Each directory contains these files : .gff .maker.augustus.transcripts.fasta .maker.augustus_masked.proteins.fasta .maker.augustus.proteins.fasta .maker.augustus_masked.transcripts.fasta .maker.transcripts.fasta .maker.trnascan.transcripts.fasta .maker.proteins.fasta .maker.non_overlapping_ab_initio.transcripts.fasta .maker.non_overlapping_ab_initio.proteins.fasta run.log and the directory theVoid. However, for some contigs, one or several files are missing. I have more than 50% of contig directory missing the trnascan file. Some are missing also the two files ? ..non overlapping.. ? Some miss even more. However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. So, my questions are : - why are those files missing ? - is it problematic ? Does it mean something didn?t work well ? - should I rerun Maker on these contigs ? Thank you ! Muriel -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 4 08:14:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 May 2015 08:14:36 -0600 Subject: [maker-devel] Missing files for some contains In-Reply-To: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> References: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Message-ID: If there are no trnasscan results for the contig then there will be no trnascan fasta. The same is true for each of the other feature types. ?Carson > On May 4, 2015, at 6:37 AM, Muriel Gros-Balthazard wrote: > > Hello, > > After running Maker, I have many directories since I have many contigs. > > Each directory contains these files : > .gff > .maker.augustus.transcripts.fasta > .maker.augustus_masked.proteins.fasta > .maker.augustus.proteins.fasta > .maker.augustus_masked.transcripts.fasta > .maker.transcripts.fasta > .maker.trnascan.transcripts.fasta > .maker.proteins.fasta > .maker.non_overlapping_ab_initio.transcripts.fasta > .maker.non_overlapping_ab_initio.proteins.fasta > run.log > and the directory theVoid. > > However, for some contigs, one or several files are missing. > I have more than 50% of contig directory missing the trnascan file. > Some are missing also the two files ? ..non overlapping.. ? > Some miss even more. > > However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. > So, my questions are : > - why are those files missing ? > - is it problematic ? Does it mean something didn?t work well ? > - should I rerun Maker on these contigs ? > > Thank you ! > > Muriel > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From solschech at gmail.com Tue May 5 01:41:18 2015 From: solschech at gmail.com (Sunny Sun) Date: Tue, 5 May 2015 09:41:18 +0200 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: Hi, I am trying to annotate with Maker a set of 7k scaffolds with a genome size of 160Mb. The first run returned 10% of scaffolds FAILED and the remaining FINISHED but I didn't get the protein or transcripts fasta files so I modified the configuration files accordingly and I am rerunning the analysis. So far, all the scaffolds are failing, in the error.log the only error I see is of this type for all scaffolds: error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod ERROR: GeneMark Failed ERROR: Genemark failed --> rank=NA, hostname=sol ERROR: Failed while preparing ab-inits ERROR: Chunk failed at level:0, tier_type:2 FAILED CONTIG:scaffold_6 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold_6 examining contents of the fasta file and run log which doesn't tell me much. I attached the config files. Can someone see what is wrong? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1412 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1535 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4866 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Tue May 5 09:47:39 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 5 May 2015 09:47:39 -0600 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: It looks like you are giving the training file from SNAP to genemark, and that would cause a failure. Try running it with just SNAP and augustus and see if that fixes the problem. Thanks, Mike On Tue, May 5, 2015 at 1:41 AM, Sunny Sun wrote: > > Hi, > I am trying to annotate with Maker a set of 7k scaffolds with a genome > size of 160Mb. The first run returned 10% of scaffolds FAILED and the > remaining FINISHED but I didn't get the protein or transcripts fasta files > so I modified the configuration files accordingly and I am rerunning the > analysis. So far, all the scaffolds are failing, in the error.log the only > error I see is of this type for all scaffolds: > > error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod > ERROR: GeneMark Failed > ERROR: Genemark failed > --> rank=NA, hostname=sol > ERROR: Failed while preparing ab-inits > ERROR: Chunk failed at level:0, tier_type:2 > FAILED CONTIG:scaffold_6 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold_6 > > examining contents of the fasta file and run log > > which doesn't tell me much. I attached the config files. Can someone see > what is wrong? > > Thanks > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Fri May 8 15:54:48 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Fri, 8 May 2015 21:54:48 +0000 Subject: [maker-devel] creating fasta ids Message-ID: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 11 11:41:10 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 May 2015 11:41:10 -0600 Subject: [maker-devel] creating fasta ids In-Reply-To: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Message-ID: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson > On May 8, 2015, at 3:54 PM, Craig Coleman wrote: > > Hi, > I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. > > Craig Coleman > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Tue May 12 13:15:59 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Tue, 12 May 2015 19:15:59 +0000 Subject: [maker-devel] creating fasta ids In-Reply-To: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Message-ID: <71683e53ac2a481895c7a50925197223@MB10.byu.local> Thank you. Worked perfectly. Craig From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Monday, May 11, 2015 11:41 AM To: Craig Coleman Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] creating fasta ids Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson On May 8, 2015, at 3:54 PM, Craig Coleman > wrote: Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Tue May 12 15:56:46 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 17:56:46 -0400 Subject: [maker-devel] why no prediction Message-ID: Hi guys, I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. [image: ???? 2] color means: pink: Augustus light green: SNAP dark pink: pred_gff light yellow: cufflinks darkest pink: EST alignment dark yellow: protein alignment In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. Could anyone know the reason? Thanks very much! Best, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 15364 bytes Desc: not available URL: From carsonhh at gmail.com Tue May 12 16:16:33 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:16:33 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: Message-ID: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. ?Carson > On May 12, 2015, at 3:56 PM, ??? wrote: > > Hi guys, > > I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > > > > color means: > pink: Augustus > light green: SNAP > dark pink: pred_gff > light yellow: cufflinks > darkest pink: EST alignment > dark yellow: protein alignment > > In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > > Could anyone know the reason? > > Thanks very much! > > Best, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue May 12 16:18:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:18:53 -0600 Subject: [maker-devel] why no prediction In-Reply-To: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> Message-ID: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. ?Carson > On May 12, 2015, at 4:16 PM, Carson Holt wrote: > > The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > > ?Carson > > > >> On May 12, 2015, at 3:56 PM, ??? wrote: >> >> Hi guys, >> >> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >> >> >> >> color means: >> pink: Augustus >> light green: SNAP >> dark pink: pred_gff >> light yellow: cufflinks >> darkest pink: EST alignment >> dark yellow: protein alignment >> >> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >> >> Could anyone know the reason? >> >> Thanks very much! >> >> Best, >> Wenbo >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From myandell at genetics.utah.edu Tue May 12 18:31:33 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 13 May 2015 00:31:33 +0000 Subject: [maker-devel] why no prediction In-Reply-To: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? On May 12, 2015, at 4:18 PM, Carson Holt wrote: > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > ?Carson > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: >> >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. >> >> ?Carson >> >> >> >>> On May 12, 2015, at 3:56 PM, ??? wrote: >>> >>> Hi guys, >>> >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >>> >>> >>> >>> color means: >>> pink: Augustus >>> light green: SNAP >>> dark pink: pred_gff >>> light yellow: cufflinks >>> darkest pink: EST alignment >>> dark yellow: protein alignment >>> >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >>> >>> Could anyone know the reason? >>> >>> Thanks very much! >>> >>> Best, >>> Wenbo >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Tue May 12 19:06:42 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 21:06:42 -0400 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: Thank you for the help I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. Thanks, Wenbo 2015-05-12 20:31 GMT-04:00 Mark Yandell : > and finally check the splice sites for the EST splice are they valid > GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt wrote: > > > Also protein evidence will only be considered as support if it is in the > same reading frame as the ab initio prediction. Complete mismatch of > reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction > randomly overlapped by a spurious EST alignment. You would need at least > protein evidence overlap to make it believable. There is heavy discordance > among the gene predictors. Also the fact that the gene would be 90% plus > UTR if the EST does in fact represent true expression is a big factor. > More likely it?s a pseudogene or semi repetitive region. Not making this a > gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but > no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus > and pred_gff, also evidences from cufflinks, why there is no gene model > generated? I could find the gene model in the > "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is > wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 12 20:09:56 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 20:09:56 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: <0E75FA8B-61F3-4FC2-9B77-7860EA2B14E2@gmail.com> Hi Wenbo, You will actually get more gene calls from gene predictors than there are genes (often orders of magnitude more) because workable ORFs are common in a genome. So having a single exon ORF predicted is not really that noteworthy. You can expect those kind of predictions to outnumber the true gene count by as much as 10 to 1 in some genomes. The problem with the region you are showing is that it doesn?t look like a gene. Even without a more detailed look at the coordinates and evidence overlap, the image lacks the structure for evidence and prediction concordance than would be expected in a genic region. Without some form of additional evidence like a good protein match, it is just too much like a lot of spurious overlap regions that you would expect to find randomly throughout a genome. Given this, there is just not enough support to promote the region to being a gene. The predictions are still there in the output for reference purposes, but will not be promoted to gene because the evidence support is insufficient. Looking at this region, there are not good gene predictions from snap, augustus, or your pred_gff either (poor concordance). The heavy discordance among the different gene predictors suggests, they have not been sufficiently trained. One thing that can affect evidence alignment and gene predictor performance is insufficient masking of repeat elements. You may need to spend some time building a species specific repeat database using tools like RepeatModeler. Other issues that will have an affect are stretches of N?s in the sequence. You will get poor evidence alignments and predictions in what appears to be a large contig if there isn?t enough continuous usable sequence. I mention all these factors, because the region in question looks spurious and unordered. Lack of concordance in clustering patters generally means there are other structural issues with the dataset being used. I?ve attached an image below to give an example. Notice how in regions with genes the different evidence types build on each other and have remarkable concordance (SNAP and Augustus choose very similar exon patterns for example). Regions without genes still have aligned evidence from Trinity assembled mRNA-seq and ab initio gene predictors, but they are not concordant, are more spurious in nature, and can be found on both strands. Simple overlap is insufficient to generate a gene call. You have to consider the totality of evidence. Thanks, Carson > On May 12, 2015, at 7:06 PM, ??? wrote: > > Thank you for the help > > I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. > > Thanks, > Wenbo > > > > 2015-05-12 20:31 GMT-04:00 Mark Yandell >: > and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt > wrote: > > > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt > wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? > wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.png Type: image/png Size: 51295 bytes Desc: not available URL: From julian.egger at omahazoo.com Thu May 14 12:17:50 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Thu, 14 May 2015 18:17:50 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertbrutzel at googlemail.com Fri May 15 01:08:42 2015 From: bertbrutzel at googlemail.com (Bert Brutzel) Date: Fri, 15 May 2015 09:08:42 +0200 Subject: [maker-devel] Genbank submission Message-ID: <55559B7A.2080906@gmail.com> Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert From robert.king at rothamsted.ac.uk Fri May 15 09:12:34 2015 From: robert.king at rothamsted.ac.uk (Robert King) Date: Fri, 15 May 2015 15:12:34 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> References: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> Message-ID: <136AB40E0C34CF4FB9AE0DD8C22A8D7B7F484D@rothex1.rothamsted.ac.uk> Get the GFF file ready from maker and the fasta file. I then edit in geneious and export as embl format but we pay for this so you may not have but if got your end gff file by whatever means then use seqret to convert too. https://www.biostars.org/p/72220/ Not submitted to ncbi because I submit to ENA and they have a special header for embl format which means have to edit before submitting to them. Rob -----Original Message----- From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Bert Brutzel Sent: 15 May 2015 08:09 To: maker-devel at yandell-lab.org Subject: [maker-devel] Genbank submission Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 Come and join Rothamsted Research scientists for the Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 10am to 5pm. Please take a moment to view a video of all that is in store: http://www.rothamsted.ac.uk/news-views/rothamsted-research-presents-soil-life-research-exhibition-day Rothamsted Research is a company limited by guarantee, registered in England at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a not for profit charity number 802038. From carsonhh at gmail.com Fri May 15 09:30:47 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:30:47 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> Message-ID: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson > On May 14, 2015, at 12:17 PM, Julian Egger wrote: > > We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 15 09:48:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:48:20 -0600 Subject: [maker-devel] Genbank submission In-Reply-To: <55559B7A.2080906@gmail.com> References: <55559B7A.2080906@gmail.com> Message-ID: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson > On May 15, 2015, at 1:08 AM, Bert Brutzel wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjfields at illinois.edu Sat May 16 09:51:49 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 16 May 2015 15:51:49 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> References: <55559B7A.2080906@gmail.com> <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722@illinois.edu> We?ve been using GAG (mentioned in this thread), though with some fiddling. I have heard that ENA has a much easier submission process. chris On May 15, 2015, at 10:48 AM, Carson Holt > wrote: Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From smg283 at gmail.com Sat May 16 22:11:21 2015 From: smg283 at gmail.com (Scott Geib) Date: Sat, 16 May 2015 18:11:21 -1000 Subject: [maker-devel] maker-devel Digest, Vol 84, Issue 8 In-Reply-To: References: Message-ID: If anyone has bugs or suggestions for gag, let us know and we can modify. Right now we are fixing some bugs and applying to new dataset, so good time to add anything people might find useful. email myself or Brian ( bhall7 at hawaii.edu) Thanks, Scott On Sat, May 16, 2015 at 8:00 AM, wrote: > Send maker-devel mailing list submissions to > maker-devel at yandell-lab.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > or, via email, send a message with subject or body 'help' to > maker-devel-request at yandell-lab.org > > You can reach the person managing the list at > maker-devel-owner at yandell-lab.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of maker-devel digest..." > > > Today's Topics: > > 1. Re: Genbank submission (Fields, Christopher J) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 16 May 2015 15:51:49 +0000 > From: "Fields, Christopher J" > To: Carson Holt > Cc: "maker-devel at yandell-lab.org" , > Bert > Brutzel > Subject: Re: [maker-devel] Genbank submission > Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722 at illinois.edu> > Content-Type: text/plain; charset="utf-8" > > We?ve been using GAG (mentioned in this thread), though with some > fiddling. I have heard that ENA has a much easier submission process. > > chris > > On May 15, 2015, at 10:48 AM, Carson Holt carsonhh at gmail.com>> wrote: > > Here is an archived thread on this that might be useful as well ?> > > > https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ > > ?Carson > > > On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. > sed+awk+GAG....) but I simply run into to many problems? I as well tried to > load the data into a chado, but this took over two weeks and exited with > errors. Maybe someone who already submitted their MAKER annotated genome to > Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150516/672a1386/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > ------------------------------ > > End of maker-devel Digest, Vol 84, Issue 8 > ****************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 07:53:40 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 13:53:40 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST's on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST's from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don't use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won't get anything that you couldn't have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don't use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon May 18 08:38:15 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 18 May 2015 14:38:15 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Message-ID: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:08:45 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:08:45 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Message-ID: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson > On May 18, 2015, at 8:38 AM, Daniel Ence wrote: > > Hi Julian, > > The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. > > The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. > > I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. > > I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. > > Let me know if that helps, > Daniel > > > > >> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >> >> Hi Carson, >> >> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >> >> Thanks, >> >> Julian >> >> From: Carson Holt [mailto:carsonhh at gmail.com ] >> Sent: Friday, May 15, 2015 10:31 AM >> To: Julian Egger >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >> >> Hi Julian, >> >> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >> >> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >> >> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >> >> Thanks, >> Carson >> >> >> >> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >> >> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >> >> Thanks >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:16:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:16:59 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> Message-ID: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson > On May 18, 2015, at 9:08 AM, Carson Holt wrote: > > If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. > > ?Carson > > >> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >> >> Hi Julian, >> >> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >> >> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >> >> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >> >> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >> >> Let me know if that helps, >> Daniel >> >> >> >> >>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>> >>> Hi Carson, >>> >>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>> >>> Thanks, >>> >>> Julian >>> >>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>> Sent: Friday, May 15, 2015 10:31 AM >>> To: Julian Egger >>> Cc: maker-devel at yandell-lab.org >>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>> >>> Hi Julian, >>> >>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>> >>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>> >>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>> >>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>> >>> Thanks >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 09:17:46 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 15:17:46 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com>, <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? Thanks again, Julian ________________________________ From: Carson Holt [carsonhh at gmail.com] Sent: Monday, May 18, 2015 10:16 AM To: Julian Egger Cc: maker-devel at yandell-lab.org; Daniel Ence Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson On May 18, 2015, at 9:08 AM, Carson Holt > wrote: If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:31:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:31:36 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Message-ID: <36E62C76-8F05-4BA5-8CEE-91E68A08FB79@gmail.com> You have to have protein evidence from some source. Preferably at least two somewhat related organisms. Proteins take a while to align (amino acid alignment is computationally intensive), EST?s not so much. ?Carson > On May 18, 2015, at 9:17 AM, Julian Egger wrote: > > Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? > > Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? > > Thanks again, > > Julian > From: Carson Holt [carsonhh at gmail.com ] > Sent: Monday, May 18, 2015 10:16 AM > To: Julian Egger > Cc: maker-devel at yandell-lab.org ; Daniel Ence > Subject: Re: [maker-devel] Non-redundant Reference Human EST Data > > Best sources ?> > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ > > ?Carson > > > >> On May 18, 2015, at 9:08 AM, Carson Holt > wrote: >> >> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. >> >> ?Carson >> >> >>> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >>> >>> Hi Julian, >>> >>> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >>> >>> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >>> >>> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >>> >>> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >>> >>> Let me know if that helps, >>> Daniel >>> >>> >>> >>> >>>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>>> >>>> Hi Carson, >>>> >>>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>>> >>>> Thanks, >>>> >>>> Julian >>>> >>>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>>> Sent: Friday, May 15, 2015 10:31 AM >>>> To: Julian Egger >>>> Cc: maker-devel at yandell-lab.org >>>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>>> >>>> Hi Julian, >>>> >>>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>>> >>>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>>> >>>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>>> >>>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Tue May 19 13:51:54 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Tue, 19 May 2015 19:51:54 +0000 Subject: [maker-devel] Using Augustus with MAKER Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=1 protein2genome=1 I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? Sorry for all the questions, newbie here with a lot of data to work with. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Tue May 19 15:18:43 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 19 May 2015 15:18:43 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: Hi Julian, Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=0 protein2genome=0 augustus_species=human You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract Good luck, Mike On Tue, May 19, 2015 at 1:51 PM, Julian Egger wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes > as possible from genomic reads of a primate sample. I am new to using gene > prediction tools such as SNAP and Augustus, but was told Augustus would be > better for primates. I tried using reference mRNAs and protein sequences > from NCBI on the sample contig file included with the MAKER software and it > ran ok. My question is how do I now use the output to train Augustus > iteratively and thus create a file set of annotations from my original > input? > > After creating the control files with maker -CTL, the only configurations > I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the > assembly. I know the output created a gff file along with protein and mRNA > files. Do I then need to change the maker_opts file to account for the new > files and if so how and what should the maker__opts file look like now? > Was Augustus supposed to be set up on the initial maker run or do I wait > until the second run after est2genome and protein2genome were used to > initialize training for Augustus and how do the configurations change > between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 19 15:48:29 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 May 2015 15:48:29 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: <6B295FB1-46C8-44B7-A816-66DF6F45D3E0@gmail.com> A couple of corrections from the reply below. SNAP doesn?t work well on primates, so you probably don?t want to use it (the mammal hmm is not a good replacement). This suggestion comes directly from the author of SNAP. There are ways to make it work by splitting the genome into isotigs but it?s a little messy and technical, so just don?t use it on primates. Here?s a good website on training Augustus (http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ). You need some sort of results to train with. You can either use results from a protein2genome run of MAKER or a run where you use human as your species together with other evidence in MAKER (models won?t be perfect but will be enough to get training going). Unless it?s really really close evolutionarily to human, you probably don?t just want to stick to the human species file (this is because your not going to want to use SNAP, so you will need to optimize the one gene predictor you will get to use as much as possible). You need models to be in GeneBank format for training. There is a round about way to do this with GFF3 models. First use the scripts that come with MAKER for training SNAP (makerr2zff). Then follow SNAP?s training instructions on training SNAP (in SNAP?s README). Basically the following commands (where the first two files came from maker2zff) ?> fathom genome.ann genome.dna -categorize 1000 fathom uni.ann uni.dna -export 1000 -plus Then using this script from Jason Stajich, you can convert it to the export.ann and export.dna files to a genebank format file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Go ahead and run with human as your species first, so you can review models and see how models and evidence correlating in a viewer like Apollo or IGV. But I still would recommend training Augustus to your species. ?Carson > On May 19, 2015, at 3:18 PM, Michael Campbell wrote: > > Hi Julian, > > Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first > > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=0 > protein2genome=0 > augustus_species=human > > You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page > > There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. > > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > Good luck, > Mike > > On Tue, May 19, 2015 at 1:51 PM, Julian Egger > wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? > > After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Michael Campbell MS, RD. > Doctoral Candidate > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ph:585-3543 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Wed May 27 16:57:35 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Wed, 27 May 2015 15:57:35 -0700 Subject: [maker-devel] Training Augustus Message-ID: Hi all, I'm trying to train augustus with a non-model organism, I've run Maker, then trained and run SNAP twice and would now like to run Augustus on the results as well. I've seen the Augustus page on training the program and it mentioned needing a list of 200+ quality gene structures for training, is there a way that I could filter the SNAP results for the highest quality genes to feed into augustus? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri May 1 13:34:10 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 01 May 2015 19:34:10 +0000 Subject: [maker-devel] Other GFF not passed through Message-ID: Hi, Carson. I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. Cheers, Shaun ##gff-version 3 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 14:22:57 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:22:57 -0600 Subject: [maker-devel] Other GFF not passed through In-Reply-To: References: Message-ID: <9BC05C91-8960-404F-9C9C-C17BCD7C844F@gmail.com> gff3_merge is expecting to work with maker output, and the -g option specifically looks for maker produced genes (maker source tag). Since you added these lines using the other_gff option, they are in the file, but it doesn?t necessarily mean downstream maker tools will know what to do with them because maker added them blindly without attempting any interpretation/validation, etc. I purposely don?t try and make these tools support any GFF3 input possible, it just gets too hairy. What you can do though is grep the features you want out separately into another GFF3 file and then you can use gff3_merge to combine those two files. grep -P ?\tbarrnap:0.5\t? infile.gff > barnap.gff gff3_merge -s maker.gff barnap.gff > new.gff Thanks, Carson > On May 1, 2015, at 1:34 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. > > Cheers, > Shaun > > ##gff-version 3 > 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA > 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA > 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA > 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 14:54:40 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:54:40 -0600 Subject: [maker-devel] Why would MAKER generate a large gff3? In-Reply-To: References: Message-ID: <1111BFE1-EE7D-4754-BD03-24F5EE66CB1F@gmail.com> > On May 1, 2015, at 2:53 PM, Carson Holt wrote: > > There is probably something weird about the way you are running it. For example did you give it the raw RNA-seq reads instead of the assembled reads? > > The total size of the final GFF3 will be all genes + all evidence alignments + the assembly fasta (concatenated at the end per GFF3 format specifications). You can remove the fasta or the evidence alignments from the file using the options found in gff3_merge. > > ?Carson > >> >> From: John Cornelius > >> Subject: Why would MAKER generate a large gff3? >> Date: May 1, 2015 at 2:46:45 PM MDT >> To: maker-devel at yandell-lab.org >> >> >> Hello, I'm using MAKER to generate a new annotation for an organism without an officially published genome (but it does exist and I'm using it). The current annotation is primarily predictive and I'm adding RNA-Seq evidence to improve it. However, after the initial run and the following two runs with SNAP, the gff3 file generated is 24 GB in size while the old annotation file is only 82 Mb. Should is be that large? Also, what is the best way to analyze a new annotation to figure out if it is actually in decent shape? Thanks. >> >> -- >> John Cornelius >> MCB PhD Candidate >> Arizona State University >> >> >> >> From: maker-devel-request at yandell-lab.org >> Subject: confirm d8a6466a8a63fe2312cc2ce7f79739414020644e >> Date: May 1, 2015 at 2:47:11 PM MDT >> >> >> If you reply to this message, keeping the Subject: header intact, >> Mailman will discard the held message. Do this if the message is >> spam. If you reply to this message and include an Approved: header >> with the list password in it, the message will be approved for posting >> to the list. The Approved: header can also appear in the first line >> of the body of the reply. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From muriel.grosb at gmail.com Mon May 4 06:37:26 2015 From: muriel.grosb at gmail.com (Muriel Gros-Balthazard) Date: Mon, 4 May 2015 14:37:26 +0200 Subject: [maker-devel] Missing files for some contains Message-ID: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Hello, After running Maker, I have many directories since I have many contigs. Each directory contains these files : .gff .maker.augustus.transcripts.fasta .maker.augustus_masked.proteins.fasta .maker.augustus.proteins.fasta .maker.augustus_masked.transcripts.fasta .maker.transcripts.fasta .maker.trnascan.transcripts.fasta .maker.proteins.fasta .maker.non_overlapping_ab_initio.transcripts.fasta .maker.non_overlapping_ab_initio.proteins.fasta run.log and the directory theVoid. However, for some contigs, one or several files are missing. I have more than 50% of contig directory missing the trnascan file. Some are missing also the two files ? ..non overlapping.. ? Some miss even more. However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. So, my questions are : - why are those files missing ? - is it problematic ? Does it mean something didn?t work well ? - should I rerun Maker on these contigs ? Thank you ! Muriel -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 4 08:14:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 May 2015 08:14:36 -0600 Subject: [maker-devel] Missing files for some contains In-Reply-To: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> References: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Message-ID: If there are no trnasscan results for the contig then there will be no trnascan fasta. The same is true for each of the other feature types. ?Carson > On May 4, 2015, at 6:37 AM, Muriel Gros-Balthazard wrote: > > Hello, > > After running Maker, I have many directories since I have many contigs. > > Each directory contains these files : > .gff > .maker.augustus.transcripts.fasta > .maker.augustus_masked.proteins.fasta > .maker.augustus.proteins.fasta > .maker.augustus_masked.transcripts.fasta > .maker.transcripts.fasta > .maker.trnascan.transcripts.fasta > .maker.proteins.fasta > .maker.non_overlapping_ab_initio.transcripts.fasta > .maker.non_overlapping_ab_initio.proteins.fasta > run.log > and the directory theVoid. > > However, for some contigs, one or several files are missing. > I have more than 50% of contig directory missing the trnascan file. > Some are missing also the two files ? ..non overlapping.. ? > Some miss even more. > > However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. > So, my questions are : > - why are those files missing ? > - is it problematic ? Does it mean something didn?t work well ? > - should I rerun Maker on these contigs ? > > Thank you ! > > Muriel > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From solschech at gmail.com Tue May 5 01:41:18 2015 From: solschech at gmail.com (Sunny Sun) Date: Tue, 5 May 2015 09:41:18 +0200 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: Hi, I am trying to annotate with Maker a set of 7k scaffolds with a genome size of 160Mb. The first run returned 10% of scaffolds FAILED and the remaining FINISHED but I didn't get the protein or transcripts fasta files so I modified the configuration files accordingly and I am rerunning the analysis. So far, all the scaffolds are failing, in the error.log the only error I see is of this type for all scaffolds: error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod ERROR: GeneMark Failed ERROR: Genemark failed --> rank=NA, hostname=sol ERROR: Failed while preparing ab-inits ERROR: Chunk failed at level:0, tier_type:2 FAILED CONTIG:scaffold_6 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold_6 examining contents of the fasta file and run log which doesn't tell me much. I attached the config files. Can someone see what is wrong? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1413 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1536 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4867 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Tue May 5 09:47:39 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 5 May 2015 09:47:39 -0600 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: It looks like you are giving the training file from SNAP to genemark, and that would cause a failure. Try running it with just SNAP and augustus and see if that fixes the problem. Thanks, Mike On Tue, May 5, 2015 at 1:41 AM, Sunny Sun wrote: > > Hi, > I am trying to annotate with Maker a set of 7k scaffolds with a genome > size of 160Mb. The first run returned 10% of scaffolds FAILED and the > remaining FINISHED but I didn't get the protein or transcripts fasta files > so I modified the configuration files accordingly and I am rerunning the > analysis. So far, all the scaffolds are failing, in the error.log the only > error I see is of this type for all scaffolds: > > error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod > ERROR: GeneMark Failed > ERROR: Genemark failed > --> rank=NA, hostname=sol > ERROR: Failed while preparing ab-inits > ERROR: Chunk failed at level:0, tier_type:2 > FAILED CONTIG:scaffold_6 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold_6 > > examining contents of the fasta file and run log > > which doesn't tell me much. I attached the config files. Can someone see > what is wrong? > > Thanks > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Fri May 8 15:54:48 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Fri, 8 May 2015 21:54:48 +0000 Subject: [maker-devel] creating fasta ids Message-ID: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 11 11:41:10 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 May 2015 11:41:10 -0600 Subject: [maker-devel] creating fasta ids In-Reply-To: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Message-ID: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson > On May 8, 2015, at 3:54 PM, Craig Coleman wrote: > > Hi, > I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. > > Craig Coleman > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Tue May 12 13:15:59 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Tue, 12 May 2015 19:15:59 +0000 Subject: [maker-devel] creating fasta ids In-Reply-To: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Message-ID: <71683e53ac2a481895c7a50925197223@MB10.byu.local> Thank you. Worked perfectly. Craig From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Monday, May 11, 2015 11:41 AM To: Craig Coleman Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] creating fasta ids Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson On May 8, 2015, at 3:54 PM, Craig Coleman > wrote: Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Tue May 12 15:56:46 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 17:56:46 -0400 Subject: [maker-devel] why no prediction Message-ID: Hi guys, I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. [image: ???? 2] color means: pink: Augustus light green: SNAP dark pink: pred_gff light yellow: cufflinks darkest pink: EST alignment dark yellow: protein alignment In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. Could anyone know the reason? Thanks very much! Best, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 15364 bytes Desc: not available URL: From carsonhh at gmail.com Tue May 12 16:16:33 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:16:33 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: Message-ID: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. ?Carson > On May 12, 2015, at 3:56 PM, ??? wrote: > > Hi guys, > > I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > > > > color means: > pink: Augustus > light green: SNAP > dark pink: pred_gff > light yellow: cufflinks > darkest pink: EST alignment > dark yellow: protein alignment > > In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > > Could anyone know the reason? > > Thanks very much! > > Best, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue May 12 16:18:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:18:53 -0600 Subject: [maker-devel] why no prediction In-Reply-To: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> Message-ID: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. ?Carson > On May 12, 2015, at 4:16 PM, Carson Holt wrote: > > The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > > ?Carson > > > >> On May 12, 2015, at 3:56 PM, ??? wrote: >> >> Hi guys, >> >> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >> >> >> >> color means: >> pink: Augustus >> light green: SNAP >> dark pink: pred_gff >> light yellow: cufflinks >> darkest pink: EST alignment >> dark yellow: protein alignment >> >> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >> >> Could anyone know the reason? >> >> Thanks very much! >> >> Best, >> Wenbo >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From myandell at genetics.utah.edu Tue May 12 18:31:33 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 13 May 2015 00:31:33 +0000 Subject: [maker-devel] why no prediction In-Reply-To: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? On May 12, 2015, at 4:18 PM, Carson Holt wrote: > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > ?Carson > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: >> >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. >> >> ?Carson >> >> >> >>> On May 12, 2015, at 3:56 PM, ??? wrote: >>> >>> Hi guys, >>> >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >>> >>> >>> >>> color means: >>> pink: Augustus >>> light green: SNAP >>> dark pink: pred_gff >>> light yellow: cufflinks >>> darkest pink: EST alignment >>> dark yellow: protein alignment >>> >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >>> >>> Could anyone know the reason? >>> >>> Thanks very much! >>> >>> Best, >>> Wenbo >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Tue May 12 19:06:42 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 21:06:42 -0400 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: Thank you for the help I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. Thanks, Wenbo 2015-05-12 20:31 GMT-04:00 Mark Yandell : > and finally check the splice sites for the EST splice are they valid > GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt wrote: > > > Also protein evidence will only be considered as support if it is in the > same reading frame as the ab initio prediction. Complete mismatch of > reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction > randomly overlapped by a spurious EST alignment. You would need at least > protein evidence overlap to make it believable. There is heavy discordance > among the gene predictors. Also the fact that the gene would be 90% plus > UTR if the EST does in fact represent true expression is a big factor. > More likely it?s a pseudogene or semi repetitive region. Not making this a > gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but > no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus > and pred_gff, also evidences from cufflinks, why there is no gene model > generated? I could find the gene model in the > "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is > wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 12 20:09:56 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 20:09:56 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: <0E75FA8B-61F3-4FC2-9B77-7860EA2B14E2@gmail.com> Hi Wenbo, You will actually get more gene calls from gene predictors than there are genes (often orders of magnitude more) because workable ORFs are common in a genome. So having a single exon ORF predicted is not really that noteworthy. You can expect those kind of predictions to outnumber the true gene count by as much as 10 to 1 in some genomes. The problem with the region you are showing is that it doesn?t look like a gene. Even without a more detailed look at the coordinates and evidence overlap, the image lacks the structure for evidence and prediction concordance than would be expected in a genic region. Without some form of additional evidence like a good protein match, it is just too much like a lot of spurious overlap regions that you would expect to find randomly throughout a genome. Given this, there is just not enough support to promote the region to being a gene. The predictions are still there in the output for reference purposes, but will not be promoted to gene because the evidence support is insufficient. Looking at this region, there are not good gene predictions from snap, augustus, or your pred_gff either (poor concordance). The heavy discordance among the different gene predictors suggests, they have not been sufficiently trained. One thing that can affect evidence alignment and gene predictor performance is insufficient masking of repeat elements. You may need to spend some time building a species specific repeat database using tools like RepeatModeler. Other issues that will have an affect are stretches of N?s in the sequence. You will get poor evidence alignments and predictions in what appears to be a large contig if there isn?t enough continuous usable sequence. I mention all these factors, because the region in question looks spurious and unordered. Lack of concordance in clustering patters generally means there are other structural issues with the dataset being used. I?ve attached an image below to give an example. Notice how in regions with genes the different evidence types build on each other and have remarkable concordance (SNAP and Augustus choose very similar exon patterns for example). Regions without genes still have aligned evidence from Trinity assembled mRNA-seq and ab initio gene predictors, but they are not concordant, are more spurious in nature, and can be found on both strands. Simple overlap is insufficient to generate a gene call. You have to consider the totality of evidence. Thanks, Carson > On May 12, 2015, at 7:06 PM, ??? wrote: > > Thank you for the help > > I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. > > Thanks, > Wenbo > > > > 2015-05-12 20:31 GMT-04:00 Mark Yandell >: > and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt > wrote: > > > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt > wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? > wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.png Type: image/png Size: 51295 bytes Desc: not available URL: From julian.egger at omahazoo.com Thu May 14 12:17:50 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Thu, 14 May 2015 18:17:50 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertbrutzel at googlemail.com Fri May 15 01:08:42 2015 From: bertbrutzel at googlemail.com (Bert Brutzel) Date: Fri, 15 May 2015 09:08:42 +0200 Subject: [maker-devel] Genbank submission Message-ID: <55559B7A.2080906@gmail.com> Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert From robert.king at rothamsted.ac.uk Fri May 15 09:12:34 2015 From: robert.king at rothamsted.ac.uk (Robert King) Date: Fri, 15 May 2015 15:12:34 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> References: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> Message-ID: <136AB40E0C34CF4FB9AE0DD8C22A8D7B7F484D@rothex1.rothamsted.ac.uk> Get the GFF file ready from maker and the fasta file. I then edit in geneious and export as embl format but we pay for this so you may not have but if got your end gff file by whatever means then use seqret to convert too. https://www.biostars.org/p/72220/ Not submitted to ncbi because I submit to ENA and they have a special header for embl format which means have to edit before submitting to them. Rob -----Original Message----- From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Bert Brutzel Sent: 15 May 2015 08:09 To: maker-devel at yandell-lab.org Subject: [maker-devel] Genbank submission Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 Come and join Rothamsted Research scientists for the Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 10am to 5pm. Please take a moment to view a video of all that is in store: http://www.rothamsted.ac.uk/news-views/rothamsted-research-presents-soil-life-research-exhibition-day Rothamsted Research is a company limited by guarantee, registered in England at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a not for profit charity number 802038. From carsonhh at gmail.com Fri May 15 09:30:47 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:30:47 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> Message-ID: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson > On May 14, 2015, at 12:17 PM, Julian Egger wrote: > > We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 15 09:48:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:48:20 -0600 Subject: [maker-devel] Genbank submission In-Reply-To: <55559B7A.2080906@gmail.com> References: <55559B7A.2080906@gmail.com> Message-ID: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson > On May 15, 2015, at 1:08 AM, Bert Brutzel wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjfields at illinois.edu Sat May 16 09:51:49 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 16 May 2015 15:51:49 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> References: <55559B7A.2080906@gmail.com> <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722@illinois.edu> We?ve been using GAG (mentioned in this thread), though with some fiddling. I have heard that ENA has a much easier submission process. chris On May 15, 2015, at 10:48 AM, Carson Holt > wrote: Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From smg283 at gmail.com Sat May 16 22:11:21 2015 From: smg283 at gmail.com (Scott Geib) Date: Sat, 16 May 2015 18:11:21 -1000 Subject: [maker-devel] maker-devel Digest, Vol 84, Issue 8 In-Reply-To: References: Message-ID: If anyone has bugs or suggestions for gag, let us know and we can modify. Right now we are fixing some bugs and applying to new dataset, so good time to add anything people might find useful. email myself or Brian ( bhall7 at hawaii.edu) Thanks, Scott On Sat, May 16, 2015 at 8:00 AM, wrote: > Send maker-devel mailing list submissions to > maker-devel at yandell-lab.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > or, via email, send a message with subject or body 'help' to > maker-devel-request at yandell-lab.org > > You can reach the person managing the list at > maker-devel-owner at yandell-lab.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of maker-devel digest..." > > > Today's Topics: > > 1. Re: Genbank submission (Fields, Christopher J) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 16 May 2015 15:51:49 +0000 > From: "Fields, Christopher J" > To: Carson Holt > Cc: "maker-devel at yandell-lab.org" , > Bert > Brutzel > Subject: Re: [maker-devel] Genbank submission > Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722 at illinois.edu> > Content-Type: text/plain; charset="utf-8" > > We?ve been using GAG (mentioned in this thread), though with some > fiddling. I have heard that ENA has a much easier submission process. > > chris > > On May 15, 2015, at 10:48 AM, Carson Holt carsonhh at gmail.com>> wrote: > > Here is an archived thread on this that might be useful as well ?> > > > https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ > > ?Carson > > > On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. > sed+awk+GAG....) but I simply run into to many problems? I as well tried to > load the data into a chado, but this took over two weeks and exited with > errors. Maybe someone who already submitted their MAKER annotated genome to > Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150516/672a1386/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > ------------------------------ > > End of maker-devel Digest, Vol 84, Issue 8 > ****************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 07:53:40 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 13:53:40 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST's on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST's from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don't use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won't get anything that you couldn't have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don't use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon May 18 08:38:15 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 18 May 2015 14:38:15 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Message-ID: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:08:45 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:08:45 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Message-ID: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson > On May 18, 2015, at 8:38 AM, Daniel Ence wrote: > > Hi Julian, > > The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. > > The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. > > I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. > > I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. > > Let me know if that helps, > Daniel > > > > >> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >> >> Hi Carson, >> >> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >> >> Thanks, >> >> Julian >> >> From: Carson Holt [mailto:carsonhh at gmail.com ] >> Sent: Friday, May 15, 2015 10:31 AM >> To: Julian Egger >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >> >> Hi Julian, >> >> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >> >> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >> >> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >> >> Thanks, >> Carson >> >> >> >> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >> >> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >> >> Thanks >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:16:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:16:59 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> Message-ID: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson > On May 18, 2015, at 9:08 AM, Carson Holt wrote: > > If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. > > ?Carson > > >> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >> >> Hi Julian, >> >> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >> >> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >> >> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >> >> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >> >> Let me know if that helps, >> Daniel >> >> >> >> >>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>> >>> Hi Carson, >>> >>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>> >>> Thanks, >>> >>> Julian >>> >>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>> Sent: Friday, May 15, 2015 10:31 AM >>> To: Julian Egger >>> Cc: maker-devel at yandell-lab.org >>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>> >>> Hi Julian, >>> >>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>> >>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>> >>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>> >>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>> >>> Thanks >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 09:17:46 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 15:17:46 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com>, <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? Thanks again, Julian ________________________________ From: Carson Holt [carsonhh at gmail.com] Sent: Monday, May 18, 2015 10:16 AM To: Julian Egger Cc: maker-devel at yandell-lab.org; Daniel Ence Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson On May 18, 2015, at 9:08 AM, Carson Holt > wrote: If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:31:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:31:36 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Message-ID: <36E62C76-8F05-4BA5-8CEE-91E68A08FB79@gmail.com> You have to have protein evidence from some source. Preferably at least two somewhat related organisms. Proteins take a while to align (amino acid alignment is computationally intensive), EST?s not so much. ?Carson > On May 18, 2015, at 9:17 AM, Julian Egger wrote: > > Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? > > Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? > > Thanks again, > > Julian > From: Carson Holt [carsonhh at gmail.com ] > Sent: Monday, May 18, 2015 10:16 AM > To: Julian Egger > Cc: maker-devel at yandell-lab.org ; Daniel Ence > Subject: Re: [maker-devel] Non-redundant Reference Human EST Data > > Best sources ?> > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ > > ?Carson > > > >> On May 18, 2015, at 9:08 AM, Carson Holt > wrote: >> >> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. >> >> ?Carson >> >> >>> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >>> >>> Hi Julian, >>> >>> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >>> >>> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >>> >>> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >>> >>> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >>> >>> Let me know if that helps, >>> Daniel >>> >>> >>> >>> >>>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>>> >>>> Hi Carson, >>>> >>>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>>> >>>> Thanks, >>>> >>>> Julian >>>> >>>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>>> Sent: Friday, May 15, 2015 10:31 AM >>>> To: Julian Egger >>>> Cc: maker-devel at yandell-lab.org >>>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>>> >>>> Hi Julian, >>>> >>>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>>> >>>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>>> >>>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>>> >>>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Tue May 19 13:51:54 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Tue, 19 May 2015 19:51:54 +0000 Subject: [maker-devel] Using Augustus with MAKER Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=1 protein2genome=1 I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? Sorry for all the questions, newbie here with a lot of data to work with. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Tue May 19 15:18:43 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 19 May 2015 15:18:43 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: Hi Julian, Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=0 protein2genome=0 augustus_species=human You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract Good luck, Mike On Tue, May 19, 2015 at 1:51 PM, Julian Egger wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes > as possible from genomic reads of a primate sample. I am new to using gene > prediction tools such as SNAP and Augustus, but was told Augustus would be > better for primates. I tried using reference mRNAs and protein sequences > from NCBI on the sample contig file included with the MAKER software and it > ran ok. My question is how do I now use the output to train Augustus > iteratively and thus create a file set of annotations from my original > input? > > After creating the control files with maker -CTL, the only configurations > I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the > assembly. I know the output created a gff file along with protein and mRNA > files. Do I then need to change the maker_opts file to account for the new > files and if so how and what should the maker__opts file look like now? > Was Augustus supposed to be set up on the initial maker run or do I wait > until the second run after est2genome and protein2genome were used to > initialize training for Augustus and how do the configurations change > between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 19 15:48:29 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 May 2015 15:48:29 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: <6B295FB1-46C8-44B7-A816-66DF6F45D3E0@gmail.com> A couple of corrections from the reply below. SNAP doesn?t work well on primates, so you probably don?t want to use it (the mammal hmm is not a good replacement). This suggestion comes directly from the author of SNAP. There are ways to make it work by splitting the genome into isotigs but it?s a little messy and technical, so just don?t use it on primates. Here?s a good website on training Augustus (http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ). You need some sort of results to train with. You can either use results from a protein2genome run of MAKER or a run where you use human as your species together with other evidence in MAKER (models won?t be perfect but will be enough to get training going). Unless it?s really really close evolutionarily to human, you probably don?t just want to stick to the human species file (this is because your not going to want to use SNAP, so you will need to optimize the one gene predictor you will get to use as much as possible). You need models to be in GeneBank format for training. There is a round about way to do this with GFF3 models. First use the scripts that come with MAKER for training SNAP (makerr2zff). Then follow SNAP?s training instructions on training SNAP (in SNAP?s README). Basically the following commands (where the first two files came from maker2zff) ?> fathom genome.ann genome.dna -categorize 1000 fathom uni.ann uni.dna -export 1000 -plus Then using this script from Jason Stajich, you can convert it to the export.ann and export.dna files to a genebank format file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Go ahead and run with human as your species first, so you can review models and see how models and evidence correlating in a viewer like Apollo or IGV. But I still would recommend training Augustus to your species. ?Carson > On May 19, 2015, at 3:18 PM, Michael Campbell wrote: > > Hi Julian, > > Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first > > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=0 > protein2genome=0 > augustus_species=human > > You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page > > There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. > > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > Good luck, > Mike > > On Tue, May 19, 2015 at 1:51 PM, Julian Egger > wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? > > After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Michael Campbell MS, RD. > Doctoral Candidate > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ph:585-3543 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Wed May 27 16:57:35 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Wed, 27 May 2015 15:57:35 -0700 Subject: [maker-devel] Training Augustus Message-ID: Hi all, I'm trying to train augustus with a non-model organism, I've run Maker, then trained and run SNAP twice and would now like to run Augustus on the results as well. I've seen the Augustus page on training the program and it mentioned needing a list of 200+ quality gene structures for training, is there a way that I could filter the SNAP results for the highest quality genes to feed into augustus? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From sjackman at gmail.com Fri May 1 13:34:10 2015 From: sjackman at gmail.com (Shaun Jackman) Date: Fri, 01 May 2015 19:34:10 +0000 Subject: [maker-devel] Other GFF not passed through Message-ID: Hi, Carson. I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. Cheers, Shaun ##gff-version 3 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 14:22:57 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:22:57 -0600 Subject: [maker-devel] Other GFF not passed through In-Reply-To: References: Message-ID: <9BC05C91-8960-404F-9C9C-C17BCD7C844F@gmail.com> gff3_merge is expecting to work with maker output, and the -g option specifically looks for maker produced genes (maker source tag). Since you added these lines using the other_gff option, they are in the file, but it doesn?t necessarily mean downstream maker tools will know what to do with them because maker added them blindly without attempting any interpretation/validation, etc. I purposely don?t try and make these tools support any GFF3 input possible, it just gets too hairy. What you can do though is grep the features you want out separately into another GFF3 file and then you can use gff3_merge to combine those two files. grep -P ?\tbarrnap:0.5\t? infile.gff > barnap.gff gff3_merge -s maker.gff barnap.gff > new.gff Thanks, Carson > On May 1, 2015, at 1:34 PM, Shaun Jackman wrote: > > Hi, Carson. > > I?m using other_gff to pass the following four-record GFF file of rRNA annotations through to the final GFF file. The rRNA records appear in the output of gff3_merge -s -n, but not when I add the -g option. Any thoughts why they?ve gone missing? These rRNA features almost certainly overlap repeat rRNA features discovered by RepeatMasker. Would that cause it? I?m happy to send along whatever data files you?d find useful to reproduce the issue. > > Cheers, > Shaun > > ##gff-version 3 > 17 barrnap:0.5 rRNA 130358 131272 . - . Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 57 percent of the 16S ribosomal RNA > 18 barrnap:0.5 rRNA 6238 7489 . - . Name=12S_rRNA;product=12S ribosomal RNA > 5 barrnap:0.5 rRNA 7 1246 . - . Name=12S_rRNA;product=12S ribosomal RNA > 6 barrnap:0.5 rRNA 449193 450443 . + . Name=12S_rRNA;product=12S ribosomal RNA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 1 14:54:40 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 May 2015 14:54:40 -0600 Subject: [maker-devel] Why would MAKER generate a large gff3? In-Reply-To: References: Message-ID: <1111BFE1-EE7D-4754-BD03-24F5EE66CB1F@gmail.com> > On May 1, 2015, at 2:53 PM, Carson Holt wrote: > > There is probably something weird about the way you are running it. For example did you give it the raw RNA-seq reads instead of the assembled reads? > > The total size of the final GFF3 will be all genes + all evidence alignments + the assembly fasta (concatenated at the end per GFF3 format specifications). You can remove the fasta or the evidence alignments from the file using the options found in gff3_merge. > > ?Carson > >> >> From: John Cornelius > >> Subject: Why would MAKER generate a large gff3? >> Date: May 1, 2015 at 2:46:45 PM MDT >> To: maker-devel at yandell-lab.org >> >> >> Hello, I'm using MAKER to generate a new annotation for an organism without an officially published genome (but it does exist and I'm using it). The current annotation is primarily predictive and I'm adding RNA-Seq evidence to improve it. However, after the initial run and the following two runs with SNAP, the gff3 file generated is 24 GB in size while the old annotation file is only 82 Mb. Should is be that large? Also, what is the best way to analyze a new annotation to figure out if it is actually in decent shape? Thanks. >> >> -- >> John Cornelius >> MCB PhD Candidate >> Arizona State University >> >> >> >> From: maker-devel-request at yandell-lab.org >> Subject: confirm d8a6466a8a63fe2312cc2ce7f79739414020644e >> Date: May 1, 2015 at 2:47:11 PM MDT >> >> >> If you reply to this message, keeping the Subject: header intact, >> Mailman will discard the held message. Do this if the message is >> spam. If you reply to this message and include an Approved: header >> with the list password in it, the message will be approved for posting >> to the list. The Approved: header can also appear in the first line >> of the body of the reply. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From muriel.grosb at gmail.com Mon May 4 06:37:26 2015 From: muriel.grosb at gmail.com (Muriel Gros-Balthazard) Date: Mon, 4 May 2015 14:37:26 +0200 Subject: [maker-devel] Missing files for some contains Message-ID: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Hello, After running Maker, I have many directories since I have many contigs. Each directory contains these files : .gff .maker.augustus.transcripts.fasta .maker.augustus_masked.proteins.fasta .maker.augustus.proteins.fasta .maker.augustus_masked.transcripts.fasta .maker.transcripts.fasta .maker.trnascan.transcripts.fasta .maker.proteins.fasta .maker.non_overlapping_ab_initio.transcripts.fasta .maker.non_overlapping_ab_initio.proteins.fasta run.log and the directory theVoid. However, for some contigs, one or several files are missing. I have more than 50% of contig directory missing the trnascan file. Some are missing also the two files ? ..non overlapping.. ? Some miss even more. However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. So, my questions are : - why are those files missing ? - is it problematic ? Does it mean something didn?t work well ? - should I rerun Maker on these contigs ? Thank you ! Muriel -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 4 08:14:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 May 2015 08:14:36 -0600 Subject: [maker-devel] Missing files for some contains In-Reply-To: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> References: <33606C49-1CC8-4909-953F-8B72D54843B9@yahoo.fr> Message-ID: If there are no trnasscan results for the contig then there will be no trnascan fasta. The same is true for each of the other feature types. ?Carson > On May 4, 2015, at 6:37 AM, Muriel Gros-Balthazard wrote: > > Hello, > > After running Maker, I have many directories since I have many contigs. > > Each directory contains these files : > .gff > .maker.augustus.transcripts.fasta > .maker.augustus_masked.proteins.fasta > .maker.augustus.proteins.fasta > .maker.augustus_masked.transcripts.fasta > .maker.transcripts.fasta > .maker.trnascan.transcripts.fasta > .maker.proteins.fasta > .maker.non_overlapping_ab_initio.transcripts.fasta > .maker.non_overlapping_ab_initio.proteins.fasta > run.log > and the directory theVoid. > > However, for some contigs, one or several files are missing. > I have more than 50% of contig directory missing the trnascan file. > Some are missing also the two files ? ..non overlapping.. ? > Some miss even more. > > However, for all these contigs for which files are missing, I have the gff file and it?s written FINISHED in the log file. > So, my questions are : > - why are those files missing ? > - is it problematic ? Does it mean something didn?t work well ? > - should I rerun Maker on these contigs ? > > Thank you ! > > Muriel > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From solschech at gmail.com Tue May 5 01:41:18 2015 From: solschech at gmail.com (Sunny Sun) Date: Tue, 5 May 2015 09:41:18 +0200 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: Hi, I am trying to annotate with Maker a set of 7k scaffolds with a genome size of 160Mb. The first run returned 10% of scaffolds FAILED and the remaining FINISHED but I didn't get the protein or transcripts fasta files so I modified the configuration files accordingly and I am rerunning the analysis. So far, all the scaffolds are failing, in the error.log the only error I see is of this type for all scaffolds: error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod ERROR: GeneMark Failed ERROR: Genemark failed --> rank=NA, hostname=sol ERROR: Failed while preparing ab-inits ERROR: Chunk failed at level:0, tier_type:2 FAILED CONTIG:scaffold_6 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold_6 examining contents of the fasta file and run log which doesn't tell me much. I attached the config files. Can someone see what is wrong? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1413 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1536 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4867 bytes Desc: not available URL: From michael.s.campbell1 at gmail.com Tue May 5 09:47:39 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 5 May 2015 09:47:39 -0600 Subject: [maker-devel] Fwd: all scaffolds failed In-Reply-To: References: Message-ID: It looks like you are giving the training file from SNAP to genemark, and that would cause a failure. Try running it with just SNAP and augustus and see if that fixes the problem. Thanks, Mike On Tue, May 5, 2015 at 1:41 AM, Sunny Sun wrote: > > Hi, > I am trying to annotate with Maker a set of 7k scaffolds with a genome > size of 160Mb. The first run returned 10% of scaffolds FAILED and the > remaining FINISHED but I didn't get the protein or transcripts fasta files > so I modified the configuration files accordingly and I am rerunning the > analysis. So far, all the scaffolds are failing, in the error.log the only > error I see is of this type for all scaffolds: > > error in file format, /tmp/I16VH9JkXG/Q3Pi3DDTOZ_mod > ERROR: GeneMark Failed > ERROR: Genemark failed > --> rank=NA, hostname=sol > ERROR: Failed while preparing ab-inits > ERROR: Chunk failed at level:0, tier_type:2 > FAILED CONTIG:scaffold_6 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold_6 > > examining contents of the fasta file and run log > > which doesn't tell me much. I attached the config files. Can someone see > what is wrong? > > Thanks > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Fri May 8 15:54:48 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Fri, 8 May 2015 21:54:48 +0000 Subject: [maker-devel] creating fasta ids Message-ID: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 11 11:41:10 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 May 2015 11:41:10 -0600 Subject: [maker-devel] creating fasta ids In-Reply-To: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> Message-ID: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson > On May 8, 2015, at 3:54 PM, Craig Coleman wrote: > > Hi, > I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. > > Craig Coleman > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From craig_coleman at byu.edu Tue May 12 13:15:59 2015 From: craig_coleman at byu.edu (Craig Coleman) Date: Tue, 12 May 2015 19:15:59 +0000 Subject: [maker-devel] creating fasta ids In-Reply-To: <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> References: <1fa61ad0ddbd4401a580d9158921f946@MB10.byu.local> <37C15F61-1571-4963-B429-CEA424D0E320@gmail.com> Message-ID: <71683e53ac2a481895c7a50925197223@MB10.byu.local> Thank you. Worked perfectly. Craig From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Monday, May 11, 2015 11:41 AM To: Craig Coleman Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] creating fasta ids Pass in GeneMark results as pred_gff if you were using model_gff. By using model_gff you are essentially telling MAKER to protect certain information in the Name tags as well as to keep all models regardless of evidence support from that file. Because of the way you ran things, when you currently move to the maker_map_ids step it builds new names off of the ID= portion because Name= has a specific protected meaning in GFF3 format, so if your Name= and ID= tags are not identical then the script knows they are user supplied values and cannot assume they are alterable. ?Carson On May 8, 2015, at 3:54 PM, Craig Coleman > wrote: Hi, I have a fungal genome that I successfully ran through Maker. Rather than running GeneMark directly in Maker I ran it separately and generated a gff file. I provided this gff file to Maker and ran SNAP and Augustus as well. The GeneMark gff3 file contained unique IDs for genes that are carried over into the protein and transcript fasta files. These IDs are included in the Maker gff file as the name given for the mRNA feature but Maker creates a different ID for the GeneMark generated features in the gff file. When I run maker_map_ids to create a mapping file of gene and transcript IDs, the program uses the Maker generated ID instead of the GeneMark generated name. Then when I run map_fasta_ids the Maker IDs in the map file do not match the names of the GeneMark proteins and transcripts in the fasta file. Protein and transcript models generated by SNAP and Augustus map just fine. I am hoping someone has a suggestion on how to solve this problem. Craig Coleman _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Tue May 12 15:56:46 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 17:56:46 -0400 Subject: [maker-devel] why no prediction Message-ID: Hi guys, I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. [image: ???? 2] color means: pink: Augustus light green: SNAP dark pink: pred_gff light yellow: cufflinks darkest pink: EST alignment dark yellow: protein alignment In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. Could anyone know the reason? Thanks very much! Best, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 15364 bytes Desc: not available URL: From carsonhh at gmail.com Tue May 12 16:16:33 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:16:33 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: Message-ID: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. ?Carson > On May 12, 2015, at 3:56 PM, ??? wrote: > > Hi guys, > > I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > > > > color means: > pink: Augustus > light green: SNAP > dark pink: pred_gff > light yellow: cufflinks > darkest pink: EST alignment > dark yellow: protein alignment > > In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > > Could anyone know the reason? > > Thanks very much! > > Best, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue May 12 16:18:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 16:18:53 -0600 Subject: [maker-devel] why no prediction In-Reply-To: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> Message-ID: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. ?Carson > On May 12, 2015, at 4:16 PM, Carson Holt wrote: > > The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > > ?Carson > > > >> On May 12, 2015, at 3:56 PM, ??? wrote: >> >> Hi guys, >> >> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >> >> >> >> color means: >> pink: Augustus >> light green: SNAP >> dark pink: pred_gff >> light yellow: cufflinks >> darkest pink: EST alignment >> dark yellow: protein alignment >> >> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >> >> Could anyone know the reason? >> >> Thanks very much! >> >> Best, >> Wenbo >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > From myandell at genetics.utah.edu Tue May 12 18:31:33 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 13 May 2015 00:31:33 +0000 Subject: [maker-devel] why no prediction In-Reply-To: <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? On May 12, 2015, at 4:18 PM, Carson Holt wrote: > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > ?Carson > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: >> >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. >> >> ?Carson >> >> >> >>> On May 12, 2015, at 3:56 PM, ??? wrote: >>> >>> Hi guys, >>> >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. >>> >>> >>> >>> color means: >>> pink: Augustus >>> light green: SNAP >>> dark pink: pred_gff >>> light yellow: cufflinks >>> darkest pink: EST alignment >>> dark yellow: protein alignment >>> >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. >>> >>> Could anyone know the reason? >>> >>> Thanks very much! >>> >>> Best, >>> Wenbo >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Tue May 12 19:06:42 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Tue, 12 May 2015 21:06:42 -0400 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: Thank you for the help I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. Thanks, Wenbo 2015-05-12 20:31 GMT-04:00 Mark Yandell : > and finally check the splice sites for the EST splice are they valid > GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt wrote: > > > Also protein evidence will only be considered as support if it is in the > same reading frame as the ab initio prediction. Complete mismatch of > reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction > randomly overlapped by a spurious EST alignment. You would need at least > protein evidence overlap to make it believable. There is heavy discordance > among the gene predictors. Also the fact that the gene would be 90% plus > UTR if the EST does in fact represent true expression is a big factor. > More likely it?s a pseudogene or semi repetitive region. Not making this a > gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but > no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus > and pred_gff, also evidences from cufflinks, why there is no gene model > generated? I could find the gene model in the > "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is > wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 12 20:09:56 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 May 2015 20:09:56 -0600 Subject: [maker-devel] why no prediction In-Reply-To: References: <39947BAA-6172-4318-85C2-4E9288B5C700@gmail.com> <97DE16F1-4CD0-4E71-BC89-49CFFBDE8CA7@gmail.com> Message-ID: <0E75FA8B-61F3-4FC2-9B77-7860EA2B14E2@gmail.com> Hi Wenbo, You will actually get more gene calls from gene predictors than there are genes (often orders of magnitude more) because workable ORFs are common in a genome. So having a single exon ORF predicted is not really that noteworthy. You can expect those kind of predictions to outnumber the true gene count by as much as 10 to 1 in some genomes. The problem with the region you are showing is that it doesn?t look like a gene. Even without a more detailed look at the coordinates and evidence overlap, the image lacks the structure for evidence and prediction concordance than would be expected in a genic region. Without some form of additional evidence like a good protein match, it is just too much like a lot of spurious overlap regions that you would expect to find randomly throughout a genome. Given this, there is just not enough support to promote the region to being a gene. The predictions are still there in the output for reference purposes, but will not be promoted to gene because the evidence support is insufficient. Looking at this region, there are not good gene predictions from snap, augustus, or your pred_gff either (poor concordance). The heavy discordance among the different gene predictors suggests, they have not been sufficiently trained. One thing that can affect evidence alignment and gene predictor performance is insufficient masking of repeat elements. You may need to spend some time building a species specific repeat database using tools like RepeatModeler. Other issues that will have an affect are stretches of N?s in the sequence. You will get poor evidence alignments and predictions in what appears to be a large contig if there isn?t enough continuous usable sequence. I mention all these factors, because the region in question looks spurious and unordered. Lack of concordance in clustering patters generally means there are other structural issues with the dataset being used. I?ve attached an image below to give an example. Notice how in regions with genes the different evidence types build on each other and have remarkable concordance (SNAP and Augustus choose very similar exon patterns for example). Regions without genes still have aligned evidence from Trinity assembled mRNA-seq and ab initio gene predictors, but they are not concordant, are more spurious in nature, and can be found on both strands. Simple overlap is insufficient to generate a gene call. You have to consider the totality of evidence. Thanks, Carson > On May 12, 2015, at 7:06 PM, ??? wrote: > > Thank you for the help > > I double checked the "EST alignment", and sorry that the darkest pink is the assembled transcript using Trinity, not EST. The splice sits is GT/AG. The cufflinks and Trinity result suggest that this region could transcript. There is a intact ORF in the prediction given by Ausgustus. Maybe this region should be a real gene, however it was not predicted by Maker. > > Thanks, > Wenbo > > > > 2015-05-12 20:31 GMT-04:00 Mark Yandell >: > and finally check the splice sites for the EST splice are they valid GT/AG or AT/AC? > > > On May 12, 2015, at 4:18 PM, Carson Holt > wrote: > > > Also protein evidence will only be considered as support if it is in the same reading frame as the ab initio prediction. Complete mismatch of reading frames usually suggests a repeat like region. > > > > ?Carson > > > > > >> On May 12, 2015, at 4:16 PM, Carson Holt > wrote: > >> > >> The structure of the evidence appears to suggest a spurious prediction randomly overlapped by a spurious EST alignment. You would need at least protein evidence overlap to make it believable. There is heavy discordance among the gene predictors. Also the fact that the gene would be 90% plus UTR if the EST does in fact represent true expression is a big factor. More likely it?s a pseudogene or semi repetitive region. Not making this a gene was the right call. > >> > >> ?Carson > >> > >> > >> > >>> On May 12, 2015, at 3:56 PM, ??? > wrote: > >>> > >>> Hi guys, > >>> > >>> I come with a wired case that one region in genome has evidence, but no gene prediction generated. Here are the detail. > >>> > >>> > >>> > >>> color means: > >>> pink: Augustus > >>> light green: SNAP > >>> dark pink: pred_gff > >>> light yellow: cufflinks > >>> darkest pink: EST alignment > >>> dark yellow: protein alignment > >>> > >>> In the region marked by red frame, there are predictions from Augustus and pred_gff, also evidences from cufflinks, why there is no gene model generated? I could find the gene model in the "XXXX.all.maker.non_overlapping_ab_initio.transcripts.fasta" file. It is wired because it did have evidence supported. > >>> > >>> Could anyone know the reason? > >>> > >>> Thanks very much! > >>> > >>> Best, > >>> Wenbo > >>> _______________________________________________ > >>> maker-devel mailing list > >>> maker-devel at box290.bluehost.com > >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > >> > > > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.png Type: image/png Size: 51295 bytes Desc: not available URL: From julian.egger at omahazoo.com Thu May 14 12:17:50 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Thu, 14 May 2015 18:17:50 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertbrutzel at googlemail.com Fri May 15 01:08:42 2015 From: bertbrutzel at googlemail.com (Bert Brutzel) Date: Fri, 15 May 2015 09:08:42 +0200 Subject: [maker-devel] Genbank submission Message-ID: <55559B7A.2080906@gmail.com> Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert From robert.king at rothamsted.ac.uk Fri May 15 09:12:34 2015 From: robert.king at rothamsted.ac.uk (Robert King) Date: Fri, 15 May 2015 15:12:34 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> References: <429bc931-e703-411d-ba07-7994c62bc6e9@ROTHEX1.rothamsted.ac.uk> Message-ID: <136AB40E0C34CF4FB9AE0DD8C22A8D7B7F484D@rothex1.rothamsted.ac.uk> Get the GFF file ready from maker and the fasta file. I then edit in geneious and export as embl format but we pay for this so you may not have but if got your end gff file by whatever means then use seqret to convert too. https://www.biostars.org/p/72220/ Not submitted to ncbi because I submit to ENA and they have a special header for embl format which means have to edit before submitting to them. Rob -----Original Message----- From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Bert Brutzel Sent: 15 May 2015 08:09 To: maker-devel at yandell-lab.org Subject: [maker-devel] Genbank submission Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 Come and join Rothamsted Research scientists for the Soil is Life! Research Exhibition Day - Sunday May 17th, 2015 10am to 5pm. Please take a moment to view a video of all that is in store: http://www.rothamsted.ac.uk/news-views/rothamsted-research-presents-soil-life-research-exhibition-day Rothamsted Research is a company limited by guarantee, registered in England at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a not for profit charity number 802038. From carsonhh at gmail.com Fri May 15 09:30:47 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:30:47 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> Message-ID: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson > On May 14, 2015, at 12:17 PM, Julian Egger wrote: > > We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri May 15 09:48:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 15 May 2015 09:48:20 -0600 Subject: [maker-devel] Genbank submission In-Reply-To: <55559B7A.2080906@gmail.com> References: <55559B7A.2080906@gmail.com> Message-ID: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson > On May 15, 2015, at 1:08 AM, Bert Brutzel wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjfields at illinois.edu Sat May 16 09:51:49 2015 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 16 May 2015 15:51:49 +0000 Subject: [maker-devel] Genbank submission In-Reply-To: <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> References: <55559B7A.2080906@gmail.com> <1248BA7E-FDC5-4CE3-815C-8D6B9B78A33E@gmail.com> Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722@illinois.edu> We?ve been using GAG (mentioned in this thread), though with some fiddling. I have heard that ENA has a much easier submission process. chris On May 15, 2015, at 10:48 AM, Carson Holt > wrote: Here is an archived thread on this that might be useful as well ?> https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ ?Carson On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: Dear All, how to I reformat MAKER output to valid .tbl files for Genbank submission? I already tried quite some routes of formating the .gff to a .tbl (e.g. sed+awk+GAG....) but I simply run into to many problems? I as well tried to load the data into a chado, but this took over two weeks and exited with errors. Maybe someone who already submitted their MAKER annotated genome to Genbank can help me...PLEASE. Thank you very much, Bert _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From smg283 at gmail.com Sat May 16 22:11:21 2015 From: smg283 at gmail.com (Scott Geib) Date: Sat, 16 May 2015 18:11:21 -1000 Subject: [maker-devel] maker-devel Digest, Vol 84, Issue 8 In-Reply-To: References: Message-ID: If anyone has bugs or suggestions for gag, let us know and we can modify. Right now we are fixing some bugs and applying to new dataset, so good time to add anything people might find useful. email myself or Brian ( bhall7 at hawaii.edu) Thanks, Scott On Sat, May 16, 2015 at 8:00 AM, wrote: > Send maker-devel mailing list submissions to > maker-devel at yandell-lab.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > or, via email, send a message with subject or body 'help' to > maker-devel-request at yandell-lab.org > > You can reach the person managing the list at > maker-devel-owner at yandell-lab.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of maker-devel digest..." > > > Today's Topics: > > 1. Re: Genbank submission (Fields, Christopher J) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 16 May 2015 15:51:49 +0000 > From: "Fields, Christopher J" > To: Carson Holt > Cc: "maker-devel at yandell-lab.org" , > Bert > Brutzel > Subject: Re: [maker-devel] Genbank submission > Message-ID: <3A43AEA3-A63A-4AC4-81A3-DA165F048722 at illinois.edu> > Content-Type: text/plain; charset="utf-8" > > We?ve been using GAG (mentioned in this thread), though with some > fiddling. I have heard that ENA has a much easier submission process. > > chris > > On May 15, 2015, at 10:48 AM, Carson Holt carsonhh at gmail.com>> wrote: > > Here is an archived thread on this that might be useful as well ?> > > > https://groups.google.com/forum/#!searchin/maker-devel/genbank/maker-devel/qypkypBXVjs/aJACj38DpxMJ > > ?Carson > > > On May 15, 2015, at 1:08 AM, Bert Brutzel > wrote: > > Dear All, > > how to I reformat MAKER output to valid .tbl files for Genbank submission? > I already tried quite some routes of formating the .gff to a .tbl (e.g. > sed+awk+GAG....) but I simply run into to many problems? I as well tried to > load the data into a chado, but this took over two weeks and exited with > errors. Maybe someone who already submitted their MAKER annotated genome to > Genbank can help me...PLEASE. > > Thank you very much, > Bert > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20150516/672a1386/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > ------------------------------ > > End of maker-devel Digest, Vol 84, Issue 8 > ****************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 07:53:40 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 13:53:40 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST's on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST's from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don't use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won't get anything that you couldn't have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don't use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Mon May 18 08:38:15 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Mon, 18 May 2015 14:38:15 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> Message-ID: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:08:45 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:08:45 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> Message-ID: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson > On May 18, 2015, at 8:38 AM, Daniel Ence wrote: > > Hi Julian, > > The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. > > The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. > > I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. > > I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. > > Let me know if that helps, > Daniel > > > > >> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >> >> Hi Carson, >> >> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >> >> Thanks, >> >> Julian >> >> From: Carson Holt [mailto:carsonhh at gmail.com ] >> Sent: Friday, May 15, 2015 10:31 AM >> To: Julian Egger >> Cc: maker-devel at yandell-lab.org >> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >> >> Hi Julian, >> >> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >> >> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >> >> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >> >> Thanks, >> Carson >> >> >> >> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >> >> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >> >> Thanks >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:16:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:16:59 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> Message-ID: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson > On May 18, 2015, at 9:08 AM, Carson Holt wrote: > > If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. > > ?Carson > > >> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >> >> Hi Julian, >> >> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >> >> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >> >> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >> >> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >> >> Let me know if that helps, >> Daniel >> >> >> >> >>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>> >>> Hi Carson, >>> >>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>> >>> Thanks, >>> >>> Julian >>> >>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>> Sent: Friday, May 15, 2015 10:31 AM >>> To: Julian Egger >>> Cc: maker-devel at yandell-lab.org >>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>> >>> Hi Julian, >>> >>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>> >>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>> >>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>> >>> Thanks, >>> Carson >>> >>> >>> >>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>> >>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>> >>> Thanks >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Mon May 18 09:17:46 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Mon, 18 May 2015 15:17:46 +0000 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com>, <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? Thanks again, Julian ________________________________ From: Carson Holt [carsonhh at gmail.com] Sent: Monday, May 18, 2015 10:16 AM To: Julian Egger Cc: maker-devel at yandell-lab.org; Daniel Ence Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Best sources ?> ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ?Carson On May 18, 2015, at 9:08 AM, Carson Holt > wrote: If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. ?Carson On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: Hi Julian, The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. Let me know if that helps, Daniel On May 18, 2015, at 7:53 AM, Julian Egger > wrote: Hi Carson, Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? Thanks, Julian From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Friday, May 15, 2015 10:31 AM To: Julian Egger Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Non-redundant Reference Human EST Data Hi Julian, Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). Thanks, Carson On May 14, 2015, at 12:17 PM, Julian Egger > wrote: We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. Thanks _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon May 18 09:31:36 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 May 2015 09:31:36 -0600 Subject: [maker-devel] Non-redundant Reference Human EST Data In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A687@post.omahazoo.org> <39827E89-0C94-4CC9-B04A-A6FB94E26B36@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A730@post.omahazoo.org> <5B996D8E-8D55-46DF-86BE-1E8FB352CE5F@genetics.utah.edu> <175D2039-B0C2-43D8-9C12-49F4124E80F4@gmail.com> <322294B9-1E4F-4D3F-A6D8-C735FFE2A2B5@gmail.com> <5DA2ECDF8921564C9F789B4DB8E4FE35A763@post.omahazoo.org> Message-ID: <36E62C76-8F05-4BA5-8CEE-91E68A08FB79@gmail.com> You have to have protein evidence from some source. Preferably at least two somewhat related organisms. Proteins take a while to align (amino acid alignment is computationally intensive), EST?s not so much. ?Carson > On May 18, 2015, at 9:17 AM, Julian Egger wrote: > > Ok great. I had looked at those files as well. So I could set both est=rna.fa and protein=protein.fa with files from > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ ? > > Would that be worthwhile or would it just slow things up to a point where it wouldn't be worth it? > > Thanks again, > > Julian > From: Carson Holt [carsonhh at gmail.com ] > Sent: Monday, May 18, 2015 10:16 AM > To: Julian Egger > Cc: maker-devel at yandell-lab.org ; Daniel Ence > Subject: Re: [maker-devel] Non-redundant Reference Human EST Data > > Best sources ?> > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/ > ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/ > > ?Carson > > > >> On May 18, 2015, at 9:08 AM, Carson Holt > wrote: >> >> If you decide to use human transcripts because it is a closely related primate, put them into the est= option. Like I said in the previous e-mail, they may not align (because of nucleotide divergence), but you don?t want to use the alt_est option because you have proteins which will align better than alt_est and will align much faster. You can use both transcripts and proteins if you want. Don?t use human EST?s. There will be no benefit. They contain the same information as the annotated transcripts and proteins but will be noisier. You can download human transcripts and proteins from the RefSeq FTP server. Then add any additional proteomes you choose to use as additional evidence. Novel genes will only be discoverable if you have EST?s from the species being annotated. But without those, you can still identify orthologs and paralogs from other species. >> >> ?Carson >> >> >>> On May 18, 2015, at 8:38 AM, Daniel Ence > wrote: >>> >>> Hi Julian, >>> >>> The RefSeq NM models would be a good place to start for evidence, since those are curated manually. Don?t concatenate the protein and EST files together; putting amino acid seq in as EST will only give you errors in blast and vice versa. >>> >>> The number of files you use in your annotation doesn?t matter as much as the quantity and breadth of the evidence that you use. >>> >>> I don?t think that putting the RefSeq models as EST and protein evidence will make a big difference, but you?d have to put the human ESTs in as alt_ests, which takes longer to blast. >>> >>> I think a good rule of thumb for the protein evidence is to have protein evidence from two genome that are about the same distance from your target genome and a third set from a genome that?s an outgroup to all three genomes. Another good source for protein evidence is the UniRef database. >>> >>> Let me know if that helps, >>> Daniel >>> >>> >>> >>> >>>> On May 18, 2015, at 7:53 AM, Julian Egger > wrote: >>>> >>>> Hi Carson, >>>> >>>> Thank you for the response. I had assumed using human transcripts instead of ESTs might be the way to go, but I had just read so much about using ESTs for annotation. You said to use either the human proteome or human transcripts, would using both in the MAKER setup file be too inefficient? As far as using either data type, since we are trying to annotate as many genes as possible from our scaffolds, we are looking for a single file to use. For either mRNA transcripts or protein sequences, would using data from ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ be good reference data for our scaffolds? That directory has both protein files and rna files. I am not sure if a good option would be too concatenate either the rna files or protein files together and use that as the est= or protein= file. Otherwise, is there a better reference source people use for MAKER? >>>> >>>> Thanks, >>>> >>>> Julian >>>> >>>> From: Carson Holt [mailto:carsonhh at gmail.com ] >>>> Sent: Friday, May 15, 2015 10:31 AM >>>> To: Julian Egger >>>> Cc: maker-devel at yandell-lab.org >>>> Subject: Re: [maker-devel] Non-redundant Reference Human EST Data >>>> >>>> Hi Julian, >>>> >>>> Using Human EST?s on primate contigs would be very inefficient. The human genome is already annotated, so you should instead use either the human proteome or human transcripts as input. Using EST?s from another species other than the one being annotated should only be done if there is not a curated annotation set to use instead. >>>> >>>> You may be able to just give the human transcripts to the est= option if the two organisms have not diverged too much in nucleotide sequence. Don?t use the alt_est option since you have human protein annotations. The alt_est option uses tblastx to seed the alignments which will not be as accurate as the protein= option that seeds via blastx, and it is about 10 time more expensive computationally. So it will take a lot longer and you won?t get anything that you couldn?t have found using the protein data instead. >>>> >>>> Also scaffolds shorter than about 10kb will likely be too short to annotate, so you can test out your parameters on a few of only the largest contigs. In addition, don?t use SNAP because it performs poorly on primate genomes (use Augustus instead). >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>> On May 14, 2015, at 12:17 PM, Julian Egger > wrote: >>>> >>>> We have assembled scaffolds from genomic reads of a primate sample and would like to annotate as many genes as possible with MAKER. Where is the best place to find an EST file to use with MAKER containing all of the non-redundant reference humans ESTs? Was trying to look around NCBI, Ensembl, and UCSC, but not sure what the ftp site, subdirectory, and file name would be for something like that. >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From julian.egger at omahazoo.com Tue May 19 13:51:54 2015 From: julian.egger at omahazoo.com (Julian Egger) Date: Tue, 19 May 2015 19:51:54 +0000 Subject: [maker-devel] Using Augustus with MAKER Message-ID: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=1 protein2genome=1 I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? Sorry for all the questions, newbie here with a lot of data to work with. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.s.campbell1 at gmail.com Tue May 19 15:18:43 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Tue, 19 May 2015 15:18:43 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: Hi Julian, Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first genome=data/hsap_contig.fasta # contig file from example data est=data/mRNAs.fa # RNAs filtered to just mRNAs protein=data/protein.fa est2genome=0 protein2genome=0 augustus_species=human You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract Good luck, Mike On Tue, May 19, 2015 at 1:51 PM, Julian Egger wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes > as possible from genomic reads of a primate sample. I am new to using gene > prediction tools such as SNAP and Augustus, but was told Augustus would be > better for primates. I tried using reference mRNAs and protein sequences > from NCBI on the sample contig file included with the MAKER software and it > ran ok. My question is how do I now use the output to train Augustus > iteratively and thus create a file set of annotations from my original > input? > > After creating the control files with maker -CTL, the only configurations > I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the > assembly. I know the output created a gff file along with protein and mRNA > files. Do I then need to change the maker_opts file to account for the new > files and if so how and what should the maker__opts file look like now? > Was Augustus supposed to be set up on the initial maker run or do I wait > until the second run after est2genome and protein2genome were used to > initialize training for Augustus and how do the configurations change > between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue May 19 15:48:29 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 May 2015 15:48:29 -0600 Subject: [maker-devel] Using Augustus with MAKER In-Reply-To: References: <5DA2ECDF8921564C9F789B4DB8E4FE35A807@post.omahazoo.org> Message-ID: <6B295FB1-46C8-44B7-A816-66DF6F45D3E0@gmail.com> A couple of corrections from the reply below. SNAP doesn?t work well on primates, so you probably don?t want to use it (the mammal hmm is not a good replacement). This suggestion comes directly from the author of SNAP. There are ways to make it work by splitting the genome into isotigs but it?s a little messy and technical, so just don?t use it on primates. Here?s a good website on training Augustus (http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html ). You need some sort of results to train with. You can either use results from a protein2genome run of MAKER or a run where you use human as your species together with other evidence in MAKER (models won?t be perfect but will be enough to get training going). Unless it?s really really close evolutionarily to human, you probably don?t just want to stick to the human species file (this is because your not going to want to use SNAP, so you will need to optimize the one gene predictor you will get to use as much as possible). You need models to be in GeneBank format for training. There is a round about way to do this with GFF3 models. First use the scripts that come with MAKER for training SNAP (makerr2zff). Then follow SNAP?s training instructions on training SNAP (in SNAP?s README). Basically the following commands (where the first two files came from maker2zff) ?> fathom genome.ann genome.dna -categorize 1000 fathom uni.ann uni.dna -export 1000 -plus Then using this script from Jason Stajich, you can convert it to the export.ann and export.dna files to a genebank format file ?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Go ahead and run with human as your species first, so you can review models and see how models and evidence correlating in a viewer like Apollo or IGV. But I still would recommend training Augustus to your species. ?Carson > On May 19, 2015, at 3:18 PM, Michael Campbell wrote: > > Hi Julian, > > Since you are annotating a primate I would use the pre-trained human parameter for augustus. Here is what I would try first > > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=0 > protein2genome=0 > augustus_species=human > > You could also use one of the mammal HMMs packaged with SNAP as well, or use the output from the above to train SNAP. There are tutorial that walk through these steps here: > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page > > There is also a current protocols in bioinformatics article for using MAKER can may help you get started as well. > > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > Good luck, > Mike > > On Tue, May 19, 2015 at 1:51 PM, Julian Egger > wrote: > > I am trying to use Augustus in MAKER to help with annotating as many genes as possible from genomic reads of a primate sample. I am new to using gene prediction tools such as SNAP and Augustus, but was told Augustus would be better for primates. I tried using reference mRNAs and protein sequences from NCBI on the sample contig file included with the MAKER software and it ran ok. My question is how do I now use the output to train Augustus iteratively and thus create a file set of annotations from my original input? > > After creating the control files with maker -CTL, the only configurations I made to maker_opts.ctl were: > genome=data/hsap_contig.fasta # contig file from example data > est=data/mRNAs.fa # RNAs filtered to just mRNAs > protein=data/protein.fa > est2genome=1 > protein2genome=1 > > I will eventually replace the contig file with our scaffolds file from the assembly. I know the output created a gff file along with protein and mRNA files. Do I then need to change the maker_opts file to account for the new files and if so how and what should the maker__opts file look like now? Was Augustus supposed to be set up on the initial maker run or do I wait until the second run after est2genome and protein2genome were used to initialize training for Augustus and how do the configurations change between multiple iterations because I have a solid annotation set? > > Sorry for all the questions, newbie here with a lot of data to work with. > > Thanks > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Michael Campbell MS, RD. > Doctoral Candidate > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > ph:585-3543 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcornel3 at asu.edu Wed May 27 16:57:35 2015 From: jcornel3 at asu.edu (John Cornelius) Date: Wed, 27 May 2015 15:57:35 -0700 Subject: [maker-devel] Training Augustus Message-ID: Hi all, I'm trying to train augustus with a non-model organism, I've run Maker, then trained and run SNAP twice and would now like to run Augustus on the results as well. I've seen the Augustus page on training the program and it mentioned needing a list of 200+ quality gene structures for training, is there a way that I could filter the SNAP results for the highest quality genes to feed into augustus? Thanks. -- John Cornelius MCB PhD Candidate Arizona State University -------------- next part -------------- An HTML attachment was scrubbed... URL: