From chenwenbo1020 at gmail.com Sat Apr 2 18:41:26 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Sat, 2 Apr 2016 19:41:26 -0400 Subject: [maker-devel] mapping annotations to a new assembly Message-ID: Hi All, Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: genome=$PATH_TO_mygenome organism_type=eukaryotic est=$PATH_TO_transcript_seq est2genome=1 est_forward=1 After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! Best regards, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Mon Apr 4 04:52:20 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Mon, 04 Apr 2016 15:22:20 +0530 Subject: [maker-devel] Photos 2 Message-ID: Envoy? de mon Galaxy S6 edge+ Orange -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 20160404_327408_resized.zip Type: application/zip Size: 2934 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 4 11:34:45 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:34:45 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: Message-ID: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. ?Carson > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > Hi All, > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > genome=$PATH_TO_mygenome > > organism_type=eukaryotic > > est=$PATH_TO_transcript_seq > > est2genome=1 > > > est_forward=1 > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > Best regards, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Apr 4 11:40:32 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 4 Apr 2016 12:40:32 -0400 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: Hi Carson, Thank you. sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. Annotation question is : Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? Thanks! Best, Wenbo 2016-04-04 12:34 GMT-04:00 Carson Holt : > Because the assembly has changed. That means that sequence can be > different, missing, or altered to break previous CDS. You can try relaxing > the filtering parameters in maker_bopts.ctl to recover more partial or > incomplete matches. Also adjust the mx intron size to allow for really long > introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the > annotation to fit the new genome, only want to update the gene position. I > used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as > input. Only 13092 gene models were in the output. Anyone know the reason? > Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Apr 4 11:42:58 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:42:58 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: <2005D161-2359-4836-965D-1007E9BADEA6@gmail.com> MAKER will report back all positions. The value in the score column can be used to see how well they match the original (range between 0 and 100). In the event of a tie, you will need to manually select one or the other. The process of mapping onto a new assembly is unfortunately not completely automated. It still requires intervention from the user in those cases. ?Carson > On Apr 4, 2016, at 10:40 AM, ??? wrote: > > Hi Carson, > > Thank you. > > sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. > > Annotation question is : > > Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? > > Thanks! > > Best, > Wenbo > > 2016-04-04 12:34 GMT-04:00 Carson Holt >: > Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? > wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Mon Apr 18 08:13:14 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Mon, 18 Apr 2016 15:13:14 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception Message-ID: <5714DD6A.1080309@ecolevol.de> Hi, while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Did not specify a Query End or Query Begin STACK: Error::throw STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 STACK: /homes/biertank/kai/maker/bin/maker:914 ----------------------------------------------------------- --> rank=2, hostname=bioinf.uni-leipzig.de ERROR: Failed while gathering ab-init output files ERROR: Chunk failed at level:1, tier_type:2 FAILED CONTIG:scaffold20_cov246 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold20_cov246 examining contents of the fasta file and run log I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. Any idea? Thank you! Kai From carsonhh at gmail.com Mon Apr 18 15:30:28 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 Apr 2016 14:30:28 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5714DD6A.1080309@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> Message-ID: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. ?Carson > On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: > > Hi, > > while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Did not specify a Query End or Query Begin > STACK: Error::throw > STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 > STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 > STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 > STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 > STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 > STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 > STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 > STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 > STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 > STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 > STACK: /homes/biertank/kai/maker/bin/maker:914 > ----------------------------------------------------------- > --> rank=2, hostname=bioinf.uni-leipzig.de > ERROR: Failed while gathering ab-init output files > ERROR: Chunk failed at level:1, tier_type:2 > FAILED CONTIG:scaffold20_cov246 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold20_cov246 > > examining contents of the fasta file and run log > > > > I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. > > Any idea? > > Thank you! > Kai > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Tue Apr 19 07:08:18 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Tue, 19 Apr 2016 14:08:18 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? Message-ID: <57161FB2.30901@students.uni-mainz.de> Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian From kai.kamm at ecolevol.de Tue Apr 19 07:36:53 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Tue, 19 Apr 2016 14:36:53 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Message-ID: <57162665.7070409@ecolevol.de> Hello, now it seems to work. I (re)installed BioPerl like so: ------------------------------------------------------------ find the name of the latest BioPerl package: cpan>d /bioperl/ .... Distribution CJFIELDS/BioPerl-1.6.901.tar.gz Distribution CJFIELDS/BioPerl-1.6.922.tar.gz Distribution CJFIELDS/BioPerl-1.6.924.tar.gz And install the most recent: cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz ---------------------------------------------------------------- Produced some error messages during install, but Maker now works. Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. And why it worked this way on my desktop. Anyway Thanks! Am 18.04.2016 um 22:30 schrieb Carson Holt: > Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. > > Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. > > ?Carson > > >> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >> >> Hi, >> >> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >> >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Did not specify a Query End or Query Begin >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >> STACK: /homes/biertank/kai/maker/bin/maker:914 >> ----------------------------------------------------------- >> --> rank=2, hostname=bioinf.uni-leipzig.de >> ERROR: Failed while gathering ab-init output files >> ERROR: Chunk failed at level:1, tier_type:2 >> FAILED CONTIG:scaffold20_cov246 >> >> ERROR: Chunk failed at level:4, tier_type:0 >> FAILED CONTIG:scaffold20_cov246 >> >> examining contents of the fasta file and run log >> >> >> >> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >> >> Any idea? >> >> Thank you! >> Kai >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Tue Apr 19 10:08:02 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:08:02 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <57161FB2.30901@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> Message-ID: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson > On Apr 19, 2016, at 6:08 AM, Florian wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 10:18:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:18:20 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <57162665.7070409@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> Message-ID: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Intall as so ?> cpan> install Bio::Perl But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. ?Carson > On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: > > Hello, > > now it seems to work. I (re)installed BioPerl like so: > > ------------------------------------------------------------ > find the name of the latest BioPerl package: > > cpan>d /bioperl/ > > .... > > Distribution CJFIELDS/BioPerl-1.6.901.tar.gz > Distribution CJFIELDS/BioPerl-1.6.922.tar.gz > Distribution CJFIELDS/BioPerl-1.6.924.tar.gz > > And install the most recent: > > cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz > ---------------------------------------------------------------- > > Produced some error messages during install, but Maker now works. > > Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. > > And why it worked this way on my desktop. > > Anyway > Thanks! > > > Am 18.04.2016 um 22:30 schrieb Carson Holt: >> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >> >> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >> >> ?Carson >> >> >>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>> >>> Hi, >>> >>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>> >>> >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> MSG: Did not specify a Query End or Query Begin >>> STACK: Error::throw >>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>> ----------------------------------------------------------- >>> --> rank=2, hostname=bioinf.uni-leipzig.de >>> ERROR: Failed while gathering ab-init output files >>> ERROR: Chunk failed at level:1, tier_type:2 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> ERROR: Chunk failed at level:4, tier_type:0 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> examining contents of the fasta file and run log >>> >>> >>> >>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>> >>> Any idea? >>> >>> Thank you! >>> Kai >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 10:19:10 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:19:10 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Message-ID: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. ?Carson > On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: > > Intall as so ?> > cpan> install Bio::Perl > > But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. > > ?Carson > > > >> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >> >> Hello, >> >> now it seems to work. I (re)installed BioPerl like so: >> >> ------------------------------------------------------------ >> find the name of the latest BioPerl package: >> >> cpan>d /bioperl/ >> >> .... >> >> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >> >> And install the most recent: >> >> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >> ---------------------------------------------------------------- >> >> Produced some error messages during install, but Maker now works. >> >> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >> >> And why it worked this way on my desktop. >> >> Anyway >> Thanks! >> >> >> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>> >>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>> >>> ?Carson >>> >>> >>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>> >>>> Hi, >>>> >>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>> >>>> >>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>> MSG: Did not specify a Query End or Query Begin >>>> STACK: Error::throw >>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>> ----------------------------------------------------------- >>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>> ERROR: Failed while gathering ab-init output files >>>> ERROR: Chunk failed at level:1, tier_type:2 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> ERROR: Chunk failed at level:4, tier_type:0 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> examining contents of the fasta file and run log >>>> >>>> >>>> >>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>> >>>> Any idea? >>>> >>>> Thank you! >>>> Kai >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From cjfields at illinois.edu Tue Apr 19 11:11:06 2016 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 19 Apr 2016 16:11:06 +0000 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> Message-ID: Yup. Though Bio-Root has been added back (which IIRC was the main problem with breakage on the master branch). chris > On Apr 19, 2016, at 10:19 AM, Carson Holt wrote: > > FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. > > ?Carson > >> On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: >> >> Intall as so ?> >> cpan> install Bio::Perl >> >> But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. >> >> ?Carson >> >> >> >>> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >>> >>> Hello, >>> >>> now it seems to work. I (re)installed BioPerl like so: >>> >>> ------------------------------------------------------------ >>> find the name of the latest BioPerl package: >>> >>> cpan>d /bioperl/ >>> >>> .... >>> >>> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >>> >>> And install the most recent: >>> >>> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >>> ---------------------------------------------------------------- >>> >>> Produced some error messages during install, but Maker now works. >>> >>> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >>> >>> And why it worked this way on my desktop. >>> >>> Anyway >>> Thanks! >>> >>> >>> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>>> >>>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>>> >>>> ?Carson >>>> >>>> >>>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>>> >>>>> Hi, >>>>> >>>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>>> >>>>> >>>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>>> MSG: Did not specify a Query End or Query Begin >>>>> STACK: Error::throw >>>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>>> ----------------------------------------------------------- >>>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>>> ERROR: Failed while gathering ab-init output files >>>>> ERROR: Chunk failed at level:1, tier_type:2 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> ERROR: Chunk failed at level:4, tier_type:0 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> examining contents of the fasta file and run log >>>>> >>>>> >>>>> >>>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>>> >>>>> Any idea? >>>>> >>>>> Thank you! >>>>> Kai >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Apr 19 16:36:35 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 19 Apr 2016 21:36:35 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From MEC at stowers.org Tue Apr 19 16:44:04 2016 From: MEC at stowers.org (Cook, Malcolm) Date: Tue, 19 Apr 2016 21:44:04 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtools http://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian ; maker-devel Cc: Campbell, Michael Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdolze at students.uni-mainz.de Mon Apr 25 10:05:58 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Mon, 25 Apr 2016 17:05:58 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Message-ID: <571E3256.90705@students.uni-mainz.de> Hi Mike, We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. type X file type (count) ========================================================================================================= | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| ========================================================================================================= |CDS | 63953 | 65160 | +------------------------+---------------------------------------+--------------------------------------+ |contig | 5292 | 5292 | +------------------------+---------------------------------------+--------------------------------------+ |exon | 60381 | 61233 | +------------------------+---------------------------------------+--------------------------------------+ |expressed_sequence_match| 275160 | 275160 | +------------------------+---------------------------------------+--------------------------------------+ |five_prime_UTR | 9424 | 8764 | +------------------------+---------------------------------------+--------------------------------------+ |gene | 12654 | 12235 | +------------------------+---------------------------------------+--------------------------------------+ |mRNA | 13698 | 13137 | +------------------------+---------------------------------------+--------------------------------------+ |match | 146111 | 136852 | +------------------------+---------------------------------------+--------------------------------------+ |match_part |1704978 |1697601 | +------------------------+---------------------------------------+--------------------------------------+ |protein_match | 421814 | 421814 | +------------------------+---------------------------------------+--------------------------------------+ |three_prime_UTR | 6894 | 6325 | --------------------------------------------------------------------------------------------------------- regards, Florian On 25.04.2016 16:16, Campbell, Michael wrote: > Hi Florian, > > Your not off topic here. I?ve attached the paper. > > Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? > > The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. > > As annotations improve you do usually see fewer total genes but they are longer. > > One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes > > Thanks, > Mike > > > On Apr 25, 2016, at 7:55 AM, Florian > wrote: > > Hello All, > > First off, thank you all for your input! I took a look at all your suggestions and have some questions: > > The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: > > scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); > Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? > > For the moment I will take a look at GAL, though perl is not my strongest language. > > > For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. > > The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? > > You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? > > I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. > > > > kind regards, > Florian > > On 20.04.2016 15:16, Campbell, Michael wrote: > > I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. > > MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. > > There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. > https://github.com/mscampbell/Genome_annotation > > The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. > > Mike > On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: > > Just a quick thought > > The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html > > ?? > > > From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore > Sent: Tuesday, April 19, 2016 4:37 PM > To: Florian >; maker-devel > > Cc: Campbell, Michael > > Subject: Re: [maker-devel] A way to compare 2 annotation runs? > > The Sequence Ontology provides some tools for this: > > SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. > https://github.com/The-Sequence-Ontology/SOBA > > This simple example provides a table for two GFF3 files of the count of feature types: > > > SOBAcl --columns file --rows type --data type --data_type count \ > > data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff > > More complex examples are available in the test file SOBA/t/sobacl_test.sh > > > The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own > https://github.com/The-Sequence-Ontology/GAL > > If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: > > use GAL::Annotation; > > my $annot = GAL::Annotation->new(qw(file.gff file.fasta); > > my $features = $annot->features; > > > > my $genes = $features->search( {type => ?gene'} ); > > while (my $gene = $genes->next) { > > print $gene->feature_id . ?\t"; > > print $gene->splice_complexity . ?\n?; > > } > > } > > > Hope that helps, > > Barry > > > > On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: > > I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. > > ?Carson > > > > > On Apr 19, 2016, at 6:08 AM, Florian > wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts_run2.log Type: text/x-log Size: 4937 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 25 10:30:24 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 09:30:24 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E3256.90705@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> Message-ID: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. ?Carson > On Apr 25, 2016, at 9:05 AM, Florian wrote: > > > Hi Mike, > > We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. > > We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. > > > > type X file type (count) > ========================================================================================================= > | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| > ========================================================================================================= > |CDS | 63953 | 65160 | > +------------------------+---------------------------------------+--------------------------------------+ > |contig | 5292 | 5292 | > +------------------------+---------------------------------------+--------------------------------------+ > |exon | 60381 | 61233 | > +------------------------+---------------------------------------+--------------------------------------+ > |expressed_sequence_match| 275160 | 275160 | > +------------------------+---------------------------------------+--------------------------------------+ > |five_prime_UTR | 9424 | 8764 | > +------------------------+---------------------------------------+--------------------------------------+ > |gene | 12654 | 12235 | > +------------------------+---------------------------------------+--------------------------------------+ > |mRNA | 13698 | 13137 | > +------------------------+---------------------------------------+--------------------------------------+ > |match | 146111 | 136852 | > +------------------------+---------------------------------------+--------------------------------------+ > |match_part |1704978 |1697601 | > +------------------------+---------------------------------------+--------------------------------------+ > |protein_match | 421814 | 421814 | > +------------------------+---------------------------------------+--------------------------------------+ > |three_prime_UTR | 6894 | 6325 | > --------------------------------------------------------------------------------------------------------- > > > regards, > Florian > > > On 25.04.2016 16:16, Campbell, Michael wrote: >> Hi Florian, >> >> Your not off topic here. I?ve attached the paper. >> >> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >> >> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >> >> As annotations improve you do usually see fewer total genes but they are longer. >> >> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >> >> Thanks, >> Mike >> >> >> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >> >> Hello All, >> >> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >> >> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >> >> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >> >> For the moment I will take a look at GAL, though perl is not my strongest language. >> >> >> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >> >> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >> >> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >> >> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >> >> >> >> kind regards, >> Florian >> >> On 20.04.2016 15:16, Campbell, Michael wrote: >> >> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >> >> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >> >> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >> >> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >> https://github.com/mscampbell/Genome_annotation >> >> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >> >> Mike >> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >> >> Just a quick thought >> >> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >> >> ?? >> >> >> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >> Sent: Tuesday, April 19, 2016 4:37 PM >> To: Florian >; maker-devel > >> Cc: Campbell, Michael > >> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >> >> The Sequence Ontology provides some tools for this: >> >> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >> https://github.com/The-Sequence-Ontology/SOBA >> >> This simple example provides a table for two GFF3 files of the count of feature types: >> >> >> SOBAcl --columns file --rows type --data type --data_type count \ >> >> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >> >> More complex examples are available in the test file SOBA/t/sobacl_test.sh >> >> >> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >> https://github.com/The-Sequence-Ontology/GAL >> >> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >> >> use GAL::Annotation; >> >> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >> >> my $features = $annot->features; >> >> >> >> my $genes = $features->search( {type => ?gene'} ); >> >> while (my $gene = $genes->next) { >> >> print $gene->feature_id . ?\t"; >> >> print $gene->splice_complexity . ?\n?; >> >> } >> >> } >> >> >> Hope that helps, >> >> Barry >> >> >> >> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >> >> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >> >> ?Carson >> >> >> >> >> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >> >> >> Hello All, >> >> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >> >> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >> >> >> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >> >> >> best regards & thanks for your input, >> Florian >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Mon Apr 25 12:00:15 2016 From: fdolze at students.uni-mainz.de (Dolze, Florian) Date: Mon, 25 Apr 2016 17:00:15 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de>, <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> Message-ID: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? -Florian > Am 25.04.2016 um 17:30 schrieb Carson Holt : > > If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. > > ?Carson > > >> On Apr 25, 2016, at 9:05 AM, Florian wrote: >> >> >> Hi Mike, >> >> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >> >> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >> >> >> >> type X file type (count) >> ========================================================================================================= >> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >> ========================================================================================================= >> |CDS | 63953 | 65160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |contig | 5292 | 5292 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |exon | 60381 | 61233 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |expressed_sequence_match| 275160 | 275160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |five_prime_UTR | 9424 | 8764 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |gene | 12654 | 12235 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |mRNA | 13698 | 13137 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match | 146111 | 136852 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match_part |1704978 |1697601 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |protein_match | 421814 | 421814 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |three_prime_UTR | 6894 | 6325 | >> --------------------------------------------------------------------------------------------------------- >> >> >> regards, >> Florian >> >> >>> On 25.04.2016 16:16, Campbell, Michael wrote: >>> Hi Florian, >>> >>> Your not off topic here. I?ve attached the paper. >>> >>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>> >>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>> >>> As annotations improve you do usually see fewer total genes but they are longer. >>> >>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>> >>> Thanks, >>> Mike >>> >>> >>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>> >>> Hello All, >>> >>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>> >>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>> >>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>> >>> For the moment I will take a look at GAL, though perl is not my strongest language. >>> >>> >>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>> >>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>> >>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>> >>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>> >>> >>> >>> kind regards, >>> Florian >>> >>> On 20.04.2016 15:16, Campbell, Michael wrote: >>> >>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>> >>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>> >>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>> >>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>> https://github.com/mscampbell/Genome_annotation >>> >>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>> >>> Mike >>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>> >>> Just a quick thought >>> >>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>> >>> ?? >>> >>> >>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>> Sent: Tuesday, April 19, 2016 4:37 PM >>> To: Florian >; maker-devel > >>> Cc: Campbell, Michael > >>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>> >>> The Sequence Ontology provides some tools for this: >>> >>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>> https://github.com/The-Sequence-Ontology/SOBA >>> >>> This simple example provides a table for two GFF3 files of the count of feature types: >>> >>> >>> SOBAcl --columns file --rows type --data type --data_type count \ >>> >>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>> >>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>> >>> >>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>> https://github.com/The-Sequence-Ontology/GAL >>> >>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>> >>> use GAL::Annotation; >>> >>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>> >>> my $features = $annot->features; >>> >>> >>> >>> my $genes = $features->search( {type => ?gene'} ); >>> >>> while (my $gene = $genes->next) { >>> >>> print $gene->feature_id . ?\t"; >>> >>> print $gene->splice_complexity . ?\n?; >>> >>> } >>> >>> } >>> >>> >>> Hope that helps, >>> >>> Barry >>> >>> >>> >>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>> >>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>> >>> ?Carson >>> >>> >>> >>> >>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>> >>> >>> Hello All, >>> >>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>> >>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>> >>> >>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>> >>> >>> best regards & thanks for your input, >>> Florian >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Apr 25 12:03:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 11:03:32 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: keep_preds can be set to 0 or 1 right now. By definition anything not kept has an AED of 1, so you really only turn it on or off. There had been discussion about doing something more complex for when multiple gene predictors are present and support each other. But for now it is an on/off parameter. ?Carson > On Apr 25, 2016, at 11:00 AM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From bmoore at genetics.utah.edu Mon Apr 25 14:46:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 25 Apr 2016 19:46:23 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 22:04:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:04:23 +0000 Subject: [maker-devel] BUSCO References: Message-ID: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> I?m posting this message to the mailing list on behalf of Ian Misner. Ian, sorry your message and subscription request hasn?t gone through. The ISP that supports all of our mailing lists including maker is having issues with the mailman software that they can?t seem to resolve, so we currently can?t approve held messages or add new subscribers. We?re in the process of working out a new mailing list option. Thanks for you patience! Begin forwarded message: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 22:12:15 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:12:15 +0000 Subject: [maker-devel] maker-revel mailing list problems Message-ID: <7157D2ED-8F5A-4B62-BA71-6DF43831FC60@genetics.utah.edu> Hi all, Just wanted to give everyone a heads up that we?re experiencing problems with our mailing list server. Our mailing lists are supplied by an external ISP and the lists and support have been great for years, but lately the admin/moderator interface won?t allow us to approve any messages flagged for moderation or approve any new subscribers. This won?t affect most of you receiving this as all non-moderated traffic seems to be unaffected, but if you notice problems please let one of the moderators know directly: Carson Holt Michael Campbell Barry Moore We?re in the process of finding and migrating to a new mailing list server. We?ll do our best to minimize disruption and let you know as soon as we have a new system in place. Thanks for your patience. Barry Moore From xvazquezc at gmail.com Mon Apr 25 22:17:46 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 26 Apr 2016 13:17:46 +1000 Subject: [maker-devel] BUSCO In-Reply-To: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> References: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> Message-ID: Having installed Augustus, BUSCO will generate the training files in the Augustus species folder. Afterwards you only need to indicate the species profile in the Maker config file as usual. BUSCO developers say that the long run produces a better profile and should be used if you run the program to train Augustus. This is the command I used python3 BUSCO_v1.1b1.py -f -c 8 --long -o Genus_species -in > /PATH/TO/ASSEMBLY/contigs.fa -l /PATH/TO/PROFILE/fungi -m genome > On 26 April 2016 at 13:04, Barry Moore wrote: > I?m posting this message to the mailing list on behalf of Ian Misner. > Ian, sorry your message and subscription request hasn?t gone through. The > ISP that supports all of our mailing lists including maker is having issues > with the mailman software that they can?t seem to resolve, so we currently > can?t approve held messages or add new subscribers. We?re in the process > of working out a new mailing list option. Thanks for you patience! > > Begin forwarded message: > > Hello, > > Are there any guidelines for using BUSCO to help train MAKER? CEGMA has > been discontinued but I used to use the cegma2zff.pl steps to use those > proteins as a training step. BUSCO seems to train Augustus but I'm not sure > what file to pass from BUSCO to MAKER for this to be properly utilized. I > didn't see anything specific about this in the archives. > ----- > *Ian Misner, Ph.D.* > Computational Genomics Specialist > Contractor, Medical Science and Computing, Inc. > Bioinformatics and Computational Biosciences Branch (BCBB) > NIH/NIAID/OD/OSMO/OCICB > 5601 Fishers Lane, Room 4A59 > Rockville, MD 20892 > Office: 301-761-6208 > Mobile: 301-704-0151 > Email: ian.misner at nih.gov > Web: BCBB Home Page > > Twitter: @NIAIDBioIT > > > Disclaimer: The information in this e-mail and any of its attachments is > confidential and may contain sensitive information. It should not be used > by anyone who is not the original intended recipient. If you have received > this e-mail in error please inform the sender and delete it from your > mailbox or any other storage devices. National Institute of Allergy and > Infectious Diseases shall not accept liability for any statements made that > are sender's own and not expressly made on behalf of the NIAID by one of > its representatives. > > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Apr 27 13:16:28 2016 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 27 Apr 2016 18:16:28 +0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Hi Qihua, In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > From carsonhh at gmail.com Wed Apr 27 13:17:22 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 27 Apr 2016 12:17:22 -0600 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <5ED1E884-9203-4409-8298-39F1D19C0CC0@gmail.com> Use maker with MPI. MPI does not just have to be on a cluster, it can be installed on a local computer or server (you probably already have it installed and don?t realize it). Instructions on how to setup MAKER with MPI are in the README and INSTALL files in the download. Example command (on a single machine 16 core server): mpiexec -n maker mpiexec -n 16 maker Run across multiple machines (ten 16 core servers): mpiexec -hostfile -n maker mpiexec -hostfile ip_list -n 160 maker The second option requires a network mounted working directory accessible to all machines. ?Carson > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > > From hcma at uci.edu Wed Apr 27 20:04:29 2016 From: hcma at uci.edu (hcma) Date: Wed, 27 Apr 2016 18:04:29 -0700 Subject: [maker-devel] Augustus training for new species Message-ID: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Hi, I would like to use Maker to generate a set for training Augustus for a new species. The steps for training SNAP is well documented, but i am still confused as to how to train Augustus using the AugustusWeb. I have used fathom and forge to generate 'export.ann' and 'export.dna'. So what i need to do next is to run zff2augustus_gbk.pl in the directory that has the export.ann and export.dna files? Then i feed the train.gb file to AugustusWeb? Please advise. Thanks Karen From xvazquezc at gmail.com Wed Apr 27 20:14:35 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:14:35 +1000 Subject: [maker-devel] Augustus training for new species In-Reply-To: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> References: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Message-ID: Is it a plant genome? If it isn't, use BUSCO. It will do the whole training in a single step. It will get your assembly fasta file and generate the species profile in the Augustus species folder. See previous thread: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ If you have a plant genome, use the "zff2augustus_gbk.pl". I have this in my files: This will take the export.dna generated by fathom and generate a *.gb file > that will be used as "training gene structure file" in a new training > submission in WebAugustus, but remember to give it a new name in the > submission, e.g. MYGENOME_v2, or Maker won't see the difference (same > name)*: > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > *this applies if you do a re-run of Augustus within Maker On 28 April 2016 at 11:04, hcma wrote: > Hi, > > I would like to use Maker to generate a set for training Augustus for a > new species. The steps for training SNAP is well documented, but i am still > confused as to how to train Augustus using the AugustusWeb. > > I have used fathom and forge to generate 'export.ann' and 'export.dna'. So > what i need to do next is to run zff2augustus_gbk.pl in the directory > that has the export.ann and export.dna files? > > Then i feed the train.gb file to AugustusWeb? > > Please advise. > > Thanks > Karen > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Wed Apr 27 20:55:13 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:55:13 +1000 Subject: [maker-devel] error with ipr_update_gff ? Message-ID: Hi, I'm following the steps in the post processing of annotations from the 2014 GMOD tutorial but when using the ipr_update_gff I get load of errors such those below: Use of uninitialized value $method in string eq at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 190, <$IN> line 228738. > Use of uninitialized value $gene_id in hash element at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 203, <$IN> line 228738. > Is this normal? Thanks, Xabier -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacqueline.atkins at nih.gov Thu Apr 28 13:55:30 2016 From: jacqueline.atkins at nih.gov (Atkins, Jacqueline (NIH/NIAID) [C]) Date: Thu, 28 Apr 2016 18:55:30 +0000 Subject: [maker-devel] Segmenation Error Message-ID: Hi Everyone, I have a user who is reporting a segmentation error.. I am not really even sure where to start. Not sure if this is related to config issues or the way in which the software is being executed. Any advice would be greatly appreciated. Here is the command mpiexec -n 50 maker maker_opts_run1.ctl maker_bopts.ctl maker_exe.ctl --Next Contig-- examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] *** Process received signal *** [ai-hpcn063:99111] Signal: Segmentation fault (11) [ai-hpcn063:99111] Signal code: Address not mapped (1) [ai-hpcn063:99111] Failing at address: (nil) examining contents of the fasta file and run log [ai-hpcn053:119610] *** Process received signal *** [ai-hpcn053:119610] Signal: Segmentation fault (11) [ai-hpcn053:119610] Signal code: Address not mapped (1) [ai-hpcn053:119610] Failing at address: (nil) [ai-hpcn053:119610] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn053:119610] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn063:99111] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log ___________________________________________ Jacqueline Atkins, Contractor Sr. HPC Engineer National Institute of Allergy and Infectious Diseases SRA International Inc., A CSRA Company office 301-451-9644, mobile 301-767- 7110 5601 Fishers Lane, 6A60, Bethesda, MD 20852 Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Fri Apr 29 13:54:07 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Sat, 30 Apr 2016 00:24:07 +0530 Subject: [maker-devel] hi prnt Message-ID: A non-text attachment was scrubbed... Name: not available Type: multipart/alternative Size: 1 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.blanchoud at otago.ac.nz Tue Apr 5 19:15:14 2016 From: simon.blanchoud at otago.ac.nz (Simon Blanchoud) Date: Wed, 06 Apr 2016 00:15:14 -0000 Subject: [maker-devel] ncRNA predictions Message-ID: <5704550C.8010602@otago.ac.nz> Hi all, I have been annotating ab initio my de novo assembly of the Botrylloides leachi genome with MAKER 2.31.8 for some time now (3rd round running as I write). For this last round, I also wanted to get some predictions for non-coding RNAs as mentioned in the maker_opts.ctl. Now that this (seems to) work properly, I thought I should share a few issues I faced with you. First of all, both tRNAscan-SE and snoscan have really really limited documentation (which I know is none of your business), which makes things a bit trickier. Second, snoscan requires an rRNA file to work (not very obvious from maker_opts.ctl), and it turns out that there is a hard-coded limit in snoscan of 100 sequences for that rRNA file (not that the error message is helpful either). Overall, this was not exactly practical as I'm assembling a de novo genome, and thus do not have these rRNA sequences. What I did (and it seems to work okay) was to pull out the closest sequences I could find from the Rfam database sequences. By combining the information from their webiste on the RF families, the taxonomy.txt file and the corresponding fasta files (all from their FTP site), I extracted (for an eukaryote organism that is), one complete sequence for each subunit i.e. RF00001, RF00002, RF01960 and RF02543. Turns out pooling more than just one makes it extremely slow to run. You might know a better approach for getting such rRNA file but it does look like a pretty sound approach to me, and might deserve a comment in maker_opts.ctl. Third, once snoscan was running, I ran into the same issue as https://groups.google.com/d/topic/maker-devel/E6BKjXx2ra0/discussion i.e. the parsing of the snoscan output crashed. After (quite) some debugging, I found out that theere is an issue in the creation of the hash table containing the hits. As I am not sure how you wanted to organize them originally, I made a wild guess and re-wrote this section of the Widget. So it might not group the hits as you wanted but at least it now runs properly (and the output appears quite correct to me). I've attached the Widget. Otherwise, thanks heaps for all the hard work, it's an amazing tool and it does work great ! Cheers, Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: snoscan.pm Type: text/x-perl-script Size: 8128 bytes Desc: not available URL: From wangyugui.wei at gmail.com Sat Apr 9 10:35:22 2016 From: wangyugui.wei at gmail.com (Yugui Wang) Date: Sat, 09 Apr 2016 15:35:22 -0000 Subject: [maker-devel] Segmentation fault of MKAER with openmpi on CentOS 7.2 Message-ID: Hi. Segmentation fault of MKAER with openmpi on CentOS 7.2. Both MAKER 2.31.8 and 3.00.0 beta have the same error. $ mpirun -mca btl ^openib -n 4 maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 39507 on node T620 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- $ file core.39505 core.39505: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/perl /bio/hpc-bio/maker-3.00.0/bin/make $ gdb /usr/bin/perl core.39505 (gdb) where #0 0x00007f0e4a7d2060 in ?? () #1 #2 0x00007f0e4a7d2060 in ?? () #3 #4 0x00007f0e4bdfba50 in mca_btl_vader_component_progress () from /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so #5 0x00007f0e63ec8eda in opal_progress () from /usr/lib64/openmpi/lib/libopen-pal.so.13 #6 0x00007f0e4a191ac5 in mca_pml_ob1_probe () from /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so #7 0x00007f0e65b0dc06 in PMPI_Probe () from /usr/lib64/openmpi/lib/libmpi.so #8 0x00007f0e59007020 in C_MPI_Recv (buf=buf at entry=0x4146b30, source=source at entry=-1, tag=tag at entry=1111) at MPI.xs:56 #9 0x00007f0e590071e3 in XS_Parallel__Application__MPI_C_MPI_Recv (my_perl=, cv=) at MPI.c:391 #10 0x00007f0e657ce39f in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so #11 0x00007f0e657c6b16 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so #12 0x00007f0e65763925 in perl_run () from /usr/lib64/perl5/CORE/libperl.so #13 0x0000000000400d99 in main () $ echo $LD_PRELOAD /usr/lib64/openmpi/lib/libmpi.so: $ echo $OMPI_MCA_mpi_warn_on_fork 0 $ rpm -qa openmpi openmpi-1.10.0-10.el7.x86_64 $ uname -a Linux T620 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1029973 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 102400 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ mpiexec --version mpiexec (OpenRTE) 1.10.0 Report bugs to http://www.open-mpi.org/community/help/ $ From h.lee12 at uq.edu.au Tue Apr 12 22:05:12 2016 From: h.lee12 at uq.edu.au (Jenny Lee) Date: Wed, 13 Apr 2016 03:05:12 -0000 Subject: [maker-devel] Reformat maker gff3 Message-ID: <1460516670248.1644@uq.edu.au> Hi all, I would like to update my maker gff3 file to only contain the genes I've decided to keep - all maker genes, a subset of abinitio genes (which have interproscan hits). I would like to also exclude the repeats information and only retain the CDS, gene, exon and mRNA - like the format we usually see in published data. I've been trying to do this manually and it gets messy. Any ideas? Thanks a lot. Regards, Jenny Lee -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Wed Apr 20 08:16:43 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Wed, 20 Apr 2016 13:16:43 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From mcampbel at cshl.edu Mon Apr 25 09:16:42 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 14:16:42 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Hi Florian, Your not off topic here. I?ve attached the paper. Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. As annotations improve you do usually see fewer total genes but they are longer. One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes Thanks, Mike On Apr 25, 2016, at 7:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- A non-text attachment was scrubbed... Name: bi0411 (1).pdf Type: application/pdf Size: 484329 bytes Desc: bi0411 (1).pdf URL: From ian.misner at nih.gov Mon Apr 25 11:20:44 2016 From: ian.misner at nih.gov (Misner, Ian (NIH/NIAID) [C]) Date: Mon, 25 Apr 2016 16:20:44 -0000 Subject: [maker-devel] BUSCO Message-ID: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Mon Apr 25 12:29:46 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:29:46 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: Hi Florian, I just looked at the code for the AED_cdf_generator.pl script and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff Mike > On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From mcampbel at cshl.edu Mon Apr 25 12:43:50 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:43:50 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: <23F95F61-E0DD-4F55-B3F0-499FC725627D@cshl.edu> I updated the AED_cdf_generator.pl script on github so it only looks at mRNA lines. The only time that it would get AEDs from the gene predictions is if pred_stats was set to 1. Was pred_stats=1 set in the maker_opts.ctl file? Thanks, Mike > On Apr 25, 2016, at 1:29 PM, Campbell, Michael wrote: > > Hi Florian, > > I just looked at the code for the AED_cdf_generator.plscript and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff > > Mike >> On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: >> >> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. >> >> On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? >> >> -Florian >> >>> Am 25.04.2016 um 17:30 schrieb Carson Holt : >>> >>> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >>> >>> ?Carson >>> >>> >>>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>>> >>>> >>>> Hi Mike, >>>> >>>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>>> >>>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>>> >>>> >>>> >>>> type X file type (count) >>>> ========================================================================================================= >>>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>>> ========================================================================================================= >>>> |CDS | 63953 | 65160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |contig | 5292 | 5292 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |exon | 60381 | 61233 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |expressed_sequence_match| 275160 | 275160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |five_prime_UTR | 9424 | 8764 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |gene | 12654 | 12235 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |mRNA | 13698 | 13137 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match | 146111 | 136852 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match_part |1704978 |1697601 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |protein_match | 421814 | 421814 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |three_prime_UTR | 6894 | 6325 | >>>> --------------------------------------------------------------------------------------------------------- >>>> >>>> >>>> regards, >>>> Florian >>>> >>>> >>>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>>> Hi Florian, >>>>> >>>>> Your not off topic here. I?ve attached the paper. >>>>> >>>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>>> >>>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>>> >>>>> As annotations improve you do usually see fewer total genes but they are longer. >>>>> >>>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>>> >>>>> Thanks, >>>>> Mike >>>>> >>>>> >>>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>>> >>>>> Hello All, >>>>> >>>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>>> >>>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>>> >>>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>>> >>>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>>> >>>>> >>>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>>> >>>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>>> >>>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>>> >>>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>>> >>>>> >>>>> >>>>> kind regards, >>>>> Florian >>>>> >>>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>>> >>>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>>> >>>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>>> >>>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>>> >>>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>>> https://github.com/mscampbell/Genome_annotation >>>>> >>>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>>> >>>>> Mike >>>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>>> >>>>> Just a quick thought >>>>> >>>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>>> >>>>> ?? >>>>> >>>>> >>>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>>> To: Florian >; maker-devel > >>>>> Cc: Campbell, Michael > >>>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>>> >>>>> The Sequence Ontology provides some tools for this: >>>>> >>>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>>> https://github.com/The-Sequence-Ontology/SOBA >>>>> >>>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>>> >>>>> >>>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>>> >>>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>>> >>>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>>> >>>>> >>>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>>> https://github.com/The-Sequence-Ontology/GAL >>>>> >>>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>>> >>>>> use GAL::Annotation; >>>>> >>>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>>> >>>>> my $features = $annot->features; >>>>> >>>>> >>>>> >>>>> my $genes = $features->search( {type => ?gene'} ); >>>>> >>>>> while (my $gene = $genes->next) { >>>>> >>>>> print $gene->feature_id . ?\t"; >>>>> >>>>> print $gene->splice_complexity . ?\n?; >>>>> >>>>> } >>>>> >>>>> } >>>>> >>>>> >>>>> Hope that helps, >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>>> >>>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>>> >>>>> >>>>> Hello All, >>>>> >>>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>>> >>>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>>> >>>>> >>>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>>> >>>>> >>>>> best regards & thanks for your input, >>>>> Florian >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> > From mcampbel at cshl.edu Tue Apr 26 09:48:10 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Tue, 26 Apr 2016 14:48:10 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571F4E29.9080103@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> <571F4E29.9080103@students.uni-mainz.de> Message-ID: <7D300E49-AEF6-424B-912D-78F9551A14B8@cshl.edu> Glad to hear it. Good luck, Mike On Apr 26, 2016, at 7:16 AM, Florian > wrote: Hello all, With the updated scripts things look much better. I get 95% of the mRNA features with <= 0.5 AED now and SOBAcl gave me a mean AED value of 0.17 / 0.16 for run 2/3. I think thats an OK result for a newly assembled genome? Thank you all for the great help, Florian On 25.04.2016 21:46, Barry Moore wrote: Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian <fdolze at students.uni-mainz.de> wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From qlian003 at ucr.edu Wed Apr 27 13:06:48 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:06:48 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> Message-ID: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Hi, Daniel I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? Thank you Qihua > On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: > > HI Qihua, > > I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >> >> Hi Michael and Daniel, >> >> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >> >> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >> >> Thank you >> Best >> Qihua > From qlian003 at ucr.edu Wed Apr 27 13:35:09 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:35:09 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Message-ID: Hi Daniel, Actually I'm blasting with both cowpea RNASeq and common bean RNASeq. And yes, the datasets are large, so it really takes me couple weeks by now and it's still on running. Do you have advices on fastening this process? Thanks Qihua > On Apr 27, 2016, at 11:16 AM, Daniel Ence wrote: > > Hi Qihua, > > In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. > > At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: >> >> Hi, Daniel >> >> I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? >> >> Thank you >> Qihua >> >> >>> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >>> >>> HI Qihua, >>> >>> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >>> >>> ~Daniel >>> >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>>> >>>> Hi Michael and Daniel, >>>> >>>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>>> >>>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>>> >>>> Thank you >>>> Best >>>> Qihua > From chenwenbo1020 at gmail.com Sat Apr 2 17:41:26 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Sat, 2 Apr 2016 19:41:26 -0400 Subject: [maker-devel] mapping annotations to a new assembly Message-ID: Hi All, Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: genome=$PATH_TO_mygenome organism_type=eukaryotic est=$PATH_TO_transcript_seq est2genome=1 est_forward=1 After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! Best regards, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Mon Apr 4 03:52:20 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Mon, 04 Apr 2016 15:22:20 +0530 Subject: [maker-devel] Photos 2 Message-ID: Envoy? de mon Galaxy S6 edge+ Orange -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 20160404_327408_resized.zip Type: application/zip Size: 2934 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 4 10:34:45 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:34:45 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: Message-ID: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. ?Carson > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > Hi All, > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > genome=$PATH_TO_mygenome > > organism_type=eukaryotic > > est=$PATH_TO_transcript_seq > > est2genome=1 > > > est_forward=1 > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > Best regards, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Apr 4 10:40:32 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 4 Apr 2016 12:40:32 -0400 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: Hi Carson, Thank you. sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. Annotation question is : Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? Thanks! Best, Wenbo 2016-04-04 12:34 GMT-04:00 Carson Holt : > Because the assembly has changed. That means that sequence can be > different, missing, or altered to break previous CDS. You can try relaxing > the filtering parameters in maker_bopts.ctl to recover more partial or > incomplete matches. Also adjust the mx intron size to allow for really long > introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the > annotation to fit the new genome, only want to update the gene position. I > used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as > input. Only 13092 gene models were in the output. Anyone know the reason? > Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Apr 4 10:42:58 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:42:58 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: <2005D161-2359-4836-965D-1007E9BADEA6@gmail.com> MAKER will report back all positions. The value in the score column can be used to see how well they match the original (range between 0 and 100). In the event of a tie, you will need to manually select one or the other. The process of mapping onto a new assembly is unfortunately not completely automated. It still requires intervention from the user in those cases. ?Carson > On Apr 4, 2016, at 10:40 AM, ??? wrote: > > Hi Carson, > > Thank you. > > sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. > > Annotation question is : > > Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? > > Thanks! > > Best, > Wenbo > > 2016-04-04 12:34 GMT-04:00 Carson Holt >: > Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? > wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Mon Apr 18 07:13:14 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Mon, 18 Apr 2016 15:13:14 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception Message-ID: <5714DD6A.1080309@ecolevol.de> Hi, while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Did not specify a Query End or Query Begin STACK: Error::throw STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 STACK: /homes/biertank/kai/maker/bin/maker:914 ----------------------------------------------------------- --> rank=2, hostname=bioinf.uni-leipzig.de ERROR: Failed while gathering ab-init output files ERROR: Chunk failed at level:1, tier_type:2 FAILED CONTIG:scaffold20_cov246 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold20_cov246 examining contents of the fasta file and run log I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. Any idea? Thank you! Kai From carsonhh at gmail.com Mon Apr 18 14:30:28 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 Apr 2016 14:30:28 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5714DD6A.1080309@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> Message-ID: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. ?Carson > On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: > > Hi, > > while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Did not specify a Query End or Query Begin > STACK: Error::throw > STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 > STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 > STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 > STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 > STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 > STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 > STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 > STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 > STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 > STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 > STACK: /homes/biertank/kai/maker/bin/maker:914 > ----------------------------------------------------------- > --> rank=2, hostname=bioinf.uni-leipzig.de > ERROR: Failed while gathering ab-init output files > ERROR: Chunk failed at level:1, tier_type:2 > FAILED CONTIG:scaffold20_cov246 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold20_cov246 > > examining contents of the fasta file and run log > > > > I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. > > Any idea? > > Thank you! > Kai > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Tue Apr 19 06:08:18 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Tue, 19 Apr 2016 14:08:18 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? Message-ID: <57161FB2.30901@students.uni-mainz.de> Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian From kai.kamm at ecolevol.de Tue Apr 19 06:36:53 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Tue, 19 Apr 2016 14:36:53 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Message-ID: <57162665.7070409@ecolevol.de> Hello, now it seems to work. I (re)installed BioPerl like so: ------------------------------------------------------------ find the name of the latest BioPerl package: cpan>d /bioperl/ .... Distribution CJFIELDS/BioPerl-1.6.901.tar.gz Distribution CJFIELDS/BioPerl-1.6.922.tar.gz Distribution CJFIELDS/BioPerl-1.6.924.tar.gz And install the most recent: cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz ---------------------------------------------------------------- Produced some error messages during install, but Maker now works. Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. And why it worked this way on my desktop. Anyway Thanks! Am 18.04.2016 um 22:30 schrieb Carson Holt: > Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. > > Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. > > ?Carson > > >> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >> >> Hi, >> >> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >> >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Did not specify a Query End or Query Begin >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >> STACK: /homes/biertank/kai/maker/bin/maker:914 >> ----------------------------------------------------------- >> --> rank=2, hostname=bioinf.uni-leipzig.de >> ERROR: Failed while gathering ab-init output files >> ERROR: Chunk failed at level:1, tier_type:2 >> FAILED CONTIG:scaffold20_cov246 >> >> ERROR: Chunk failed at level:4, tier_type:0 >> FAILED CONTIG:scaffold20_cov246 >> >> examining contents of the fasta file and run log >> >> >> >> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >> >> Any idea? >> >> Thank you! >> Kai >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Tue Apr 19 09:08:02 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:08:02 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <57161FB2.30901@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> Message-ID: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson > On Apr 19, 2016, at 6:08 AM, Florian wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 09:18:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:18:20 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <57162665.7070409@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> Message-ID: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Intall as so ?> cpan> install Bio::Perl But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. ?Carson > On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: > > Hello, > > now it seems to work. I (re)installed BioPerl like so: > > ------------------------------------------------------------ > find the name of the latest BioPerl package: > > cpan>d /bioperl/ > > .... > > Distribution CJFIELDS/BioPerl-1.6.901.tar.gz > Distribution CJFIELDS/BioPerl-1.6.922.tar.gz > Distribution CJFIELDS/BioPerl-1.6.924.tar.gz > > And install the most recent: > > cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz > ---------------------------------------------------------------- > > Produced some error messages during install, but Maker now works. > > Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. > > And why it worked this way on my desktop. > > Anyway > Thanks! > > > Am 18.04.2016 um 22:30 schrieb Carson Holt: >> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >> >> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >> >> ?Carson >> >> >>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>> >>> Hi, >>> >>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>> >>> >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> MSG: Did not specify a Query End or Query Begin >>> STACK: Error::throw >>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>> ----------------------------------------------------------- >>> --> rank=2, hostname=bioinf.uni-leipzig.de >>> ERROR: Failed while gathering ab-init output files >>> ERROR: Chunk failed at level:1, tier_type:2 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> ERROR: Chunk failed at level:4, tier_type:0 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> examining contents of the fasta file and run log >>> >>> >>> >>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>> >>> Any idea? >>> >>> Thank you! >>> Kai >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 09:19:10 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:19:10 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Message-ID: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. ?Carson > On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: > > Intall as so ?> > cpan> install Bio::Perl > > But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. > > ?Carson > > > >> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >> >> Hello, >> >> now it seems to work. I (re)installed BioPerl like so: >> >> ------------------------------------------------------------ >> find the name of the latest BioPerl package: >> >> cpan>d /bioperl/ >> >> .... >> >> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >> >> And install the most recent: >> >> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >> ---------------------------------------------------------------- >> >> Produced some error messages during install, but Maker now works. >> >> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >> >> And why it worked this way on my desktop. >> >> Anyway >> Thanks! >> >> >> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>> >>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>> >>> ?Carson >>> >>> >>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>> >>>> Hi, >>>> >>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>> >>>> >>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>> MSG: Did not specify a Query End or Query Begin >>>> STACK: Error::throw >>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>> ----------------------------------------------------------- >>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>> ERROR: Failed while gathering ab-init output files >>>> ERROR: Chunk failed at level:1, tier_type:2 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> ERROR: Chunk failed at level:4, tier_type:0 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> examining contents of the fasta file and run log >>>> >>>> >>>> >>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>> >>>> Any idea? >>>> >>>> Thank you! >>>> Kai >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From cjfields at illinois.edu Tue Apr 19 10:11:06 2016 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 19 Apr 2016 16:11:06 +0000 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> Message-ID: Yup. Though Bio-Root has been added back (which IIRC was the main problem with breakage on the master branch). chris > On Apr 19, 2016, at 10:19 AM, Carson Holt wrote: > > FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. > > ?Carson > >> On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: >> >> Intall as so ?> >> cpan> install Bio::Perl >> >> But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. >> >> ?Carson >> >> >> >>> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >>> >>> Hello, >>> >>> now it seems to work. I (re)installed BioPerl like so: >>> >>> ------------------------------------------------------------ >>> find the name of the latest BioPerl package: >>> >>> cpan>d /bioperl/ >>> >>> .... >>> >>> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >>> >>> And install the most recent: >>> >>> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >>> ---------------------------------------------------------------- >>> >>> Produced some error messages during install, but Maker now works. >>> >>> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >>> >>> And why it worked this way on my desktop. >>> >>> Anyway >>> Thanks! >>> >>> >>> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>>> >>>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>>> >>>> ?Carson >>>> >>>> >>>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>>> >>>>> Hi, >>>>> >>>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>>> >>>>> >>>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>>> MSG: Did not specify a Query End or Query Begin >>>>> STACK: Error::throw >>>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>>> ----------------------------------------------------------- >>>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>>> ERROR: Failed while gathering ab-init output files >>>>> ERROR: Chunk failed at level:1, tier_type:2 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> ERROR: Chunk failed at level:4, tier_type:0 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> examining contents of the fasta file and run log >>>>> >>>>> >>>>> >>>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>>> >>>>> Any idea? >>>>> >>>>> Thank you! >>>>> Kai >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Apr 19 15:36:35 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 19 Apr 2016 21:36:35 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From MEC at stowers.org Tue Apr 19 15:44:04 2016 From: MEC at stowers.org (Cook, Malcolm) Date: Tue, 19 Apr 2016 21:44:04 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtools http://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian ; maker-devel Cc: Campbell, Michael Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdolze at students.uni-mainz.de Mon Apr 25 09:05:58 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Mon, 25 Apr 2016 17:05:58 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Message-ID: <571E3256.90705@students.uni-mainz.de> Hi Mike, We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. type X file type (count) ========================================================================================================= | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| ========================================================================================================= |CDS | 63953 | 65160 | +------------------------+---------------------------------------+--------------------------------------+ |contig | 5292 | 5292 | +------------------------+---------------------------------------+--------------------------------------+ |exon | 60381 | 61233 | +------------------------+---------------------------------------+--------------------------------------+ |expressed_sequence_match| 275160 | 275160 | +------------------------+---------------------------------------+--------------------------------------+ |five_prime_UTR | 9424 | 8764 | +------------------------+---------------------------------------+--------------------------------------+ |gene | 12654 | 12235 | +------------------------+---------------------------------------+--------------------------------------+ |mRNA | 13698 | 13137 | +------------------------+---------------------------------------+--------------------------------------+ |match | 146111 | 136852 | +------------------------+---------------------------------------+--------------------------------------+ |match_part |1704978 |1697601 | +------------------------+---------------------------------------+--------------------------------------+ |protein_match | 421814 | 421814 | +------------------------+---------------------------------------+--------------------------------------+ |three_prime_UTR | 6894 | 6325 | --------------------------------------------------------------------------------------------------------- regards, Florian On 25.04.2016 16:16, Campbell, Michael wrote: > Hi Florian, > > Your not off topic here. I?ve attached the paper. > > Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? > > The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. > > As annotations improve you do usually see fewer total genes but they are longer. > > One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes > > Thanks, > Mike > > > On Apr 25, 2016, at 7:55 AM, Florian > wrote: > > Hello All, > > First off, thank you all for your input! I took a look at all your suggestions and have some questions: > > The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: > > scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); > Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? > > For the moment I will take a look at GAL, though perl is not my strongest language. > > > For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. > > The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? > > You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? > > I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. > > > > kind regards, > Florian > > On 20.04.2016 15:16, Campbell, Michael wrote: > > I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. > > MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. > > There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. > https://github.com/mscampbell/Genome_annotation > > The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. > > Mike > On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: > > Just a quick thought > > The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html > > ?? > > > From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore > Sent: Tuesday, April 19, 2016 4:37 PM > To: Florian >; maker-devel > > Cc: Campbell, Michael > > Subject: Re: [maker-devel] A way to compare 2 annotation runs? > > The Sequence Ontology provides some tools for this: > > SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. > https://github.com/The-Sequence-Ontology/SOBA > > This simple example provides a table for two GFF3 files of the count of feature types: > > > SOBAcl --columns file --rows type --data type --data_type count \ > > data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff > > More complex examples are available in the test file SOBA/t/sobacl_test.sh > > > The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own > https://github.com/The-Sequence-Ontology/GAL > > If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: > > use GAL::Annotation; > > my $annot = GAL::Annotation->new(qw(file.gff file.fasta); > > my $features = $annot->features; > > > > my $genes = $features->search( {type => ?gene'} ); > > while (my $gene = $genes->next) { > > print $gene->feature_id . ?\t"; > > print $gene->splice_complexity . ?\n?; > > } > > } > > > Hope that helps, > > Barry > > > > On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: > > I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. > > ?Carson > > > > > On Apr 19, 2016, at 6:08 AM, Florian > wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts_run2.log Type: text/x-log Size: 4937 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 25 09:30:24 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 09:30:24 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E3256.90705@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> Message-ID: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. ?Carson > On Apr 25, 2016, at 9:05 AM, Florian wrote: > > > Hi Mike, > > We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. > > We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. > > > > type X file type (count) > ========================================================================================================= > | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| > ========================================================================================================= > |CDS | 63953 | 65160 | > +------------------------+---------------------------------------+--------------------------------------+ > |contig | 5292 | 5292 | > +------------------------+---------------------------------------+--------------------------------------+ > |exon | 60381 | 61233 | > +------------------------+---------------------------------------+--------------------------------------+ > |expressed_sequence_match| 275160 | 275160 | > +------------------------+---------------------------------------+--------------------------------------+ > |five_prime_UTR | 9424 | 8764 | > +------------------------+---------------------------------------+--------------------------------------+ > |gene | 12654 | 12235 | > +------------------------+---------------------------------------+--------------------------------------+ > |mRNA | 13698 | 13137 | > +------------------------+---------------------------------------+--------------------------------------+ > |match | 146111 | 136852 | > +------------------------+---------------------------------------+--------------------------------------+ > |match_part |1704978 |1697601 | > +------------------------+---------------------------------------+--------------------------------------+ > |protein_match | 421814 | 421814 | > +------------------------+---------------------------------------+--------------------------------------+ > |three_prime_UTR | 6894 | 6325 | > --------------------------------------------------------------------------------------------------------- > > > regards, > Florian > > > On 25.04.2016 16:16, Campbell, Michael wrote: >> Hi Florian, >> >> Your not off topic here. I?ve attached the paper. >> >> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >> >> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >> >> As annotations improve you do usually see fewer total genes but they are longer. >> >> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >> >> Thanks, >> Mike >> >> >> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >> >> Hello All, >> >> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >> >> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >> >> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >> >> For the moment I will take a look at GAL, though perl is not my strongest language. >> >> >> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >> >> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >> >> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >> >> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >> >> >> >> kind regards, >> Florian >> >> On 20.04.2016 15:16, Campbell, Michael wrote: >> >> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >> >> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >> >> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >> >> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >> https://github.com/mscampbell/Genome_annotation >> >> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >> >> Mike >> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >> >> Just a quick thought >> >> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >> >> ?? >> >> >> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >> Sent: Tuesday, April 19, 2016 4:37 PM >> To: Florian >; maker-devel > >> Cc: Campbell, Michael > >> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >> >> The Sequence Ontology provides some tools for this: >> >> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >> https://github.com/The-Sequence-Ontology/SOBA >> >> This simple example provides a table for two GFF3 files of the count of feature types: >> >> >> SOBAcl --columns file --rows type --data type --data_type count \ >> >> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >> >> More complex examples are available in the test file SOBA/t/sobacl_test.sh >> >> >> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >> https://github.com/The-Sequence-Ontology/GAL >> >> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >> >> use GAL::Annotation; >> >> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >> >> my $features = $annot->features; >> >> >> >> my $genes = $features->search( {type => ?gene'} ); >> >> while (my $gene = $genes->next) { >> >> print $gene->feature_id . ?\t"; >> >> print $gene->splice_complexity . ?\n?; >> >> } >> >> } >> >> >> Hope that helps, >> >> Barry >> >> >> >> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >> >> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >> >> ?Carson >> >> >> >> >> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >> >> >> Hello All, >> >> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >> >> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >> >> >> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >> >> >> best regards & thanks for your input, >> Florian >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Mon Apr 25 11:00:15 2016 From: fdolze at students.uni-mainz.de (Dolze, Florian) Date: Mon, 25 Apr 2016 17:00:15 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de>, <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> Message-ID: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? -Florian > Am 25.04.2016 um 17:30 schrieb Carson Holt : > > If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. > > ?Carson > > >> On Apr 25, 2016, at 9:05 AM, Florian wrote: >> >> >> Hi Mike, >> >> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >> >> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >> >> >> >> type X file type (count) >> ========================================================================================================= >> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >> ========================================================================================================= >> |CDS | 63953 | 65160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |contig | 5292 | 5292 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |exon | 60381 | 61233 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |expressed_sequence_match| 275160 | 275160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |five_prime_UTR | 9424 | 8764 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |gene | 12654 | 12235 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |mRNA | 13698 | 13137 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match | 146111 | 136852 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match_part |1704978 |1697601 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |protein_match | 421814 | 421814 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |three_prime_UTR | 6894 | 6325 | >> --------------------------------------------------------------------------------------------------------- >> >> >> regards, >> Florian >> >> >>> On 25.04.2016 16:16, Campbell, Michael wrote: >>> Hi Florian, >>> >>> Your not off topic here. I?ve attached the paper. >>> >>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>> >>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>> >>> As annotations improve you do usually see fewer total genes but they are longer. >>> >>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>> >>> Thanks, >>> Mike >>> >>> >>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>> >>> Hello All, >>> >>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>> >>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>> >>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>> >>> For the moment I will take a look at GAL, though perl is not my strongest language. >>> >>> >>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>> >>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>> >>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>> >>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>> >>> >>> >>> kind regards, >>> Florian >>> >>> On 20.04.2016 15:16, Campbell, Michael wrote: >>> >>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>> >>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>> >>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>> >>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>> https://github.com/mscampbell/Genome_annotation >>> >>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>> >>> Mike >>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>> >>> Just a quick thought >>> >>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>> >>> ?? >>> >>> >>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>> Sent: Tuesday, April 19, 2016 4:37 PM >>> To: Florian >; maker-devel > >>> Cc: Campbell, Michael > >>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>> >>> The Sequence Ontology provides some tools for this: >>> >>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>> https://github.com/The-Sequence-Ontology/SOBA >>> >>> This simple example provides a table for two GFF3 files of the count of feature types: >>> >>> >>> SOBAcl --columns file --rows type --data type --data_type count \ >>> >>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>> >>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>> >>> >>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>> https://github.com/The-Sequence-Ontology/GAL >>> >>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>> >>> use GAL::Annotation; >>> >>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>> >>> my $features = $annot->features; >>> >>> >>> >>> my $genes = $features->search( {type => ?gene'} ); >>> >>> while (my $gene = $genes->next) { >>> >>> print $gene->feature_id . ?\t"; >>> >>> print $gene->splice_complexity . ?\n?; >>> >>> } >>> >>> } >>> >>> >>> Hope that helps, >>> >>> Barry >>> >>> >>> >>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>> >>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>> >>> ?Carson >>> >>> >>> >>> >>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>> >>> >>> Hello All, >>> >>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>> >>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>> >>> >>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>> >>> >>> best regards & thanks for your input, >>> Florian >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Apr 25 11:03:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 11:03:32 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: keep_preds can be set to 0 or 1 right now. By definition anything not kept has an AED of 1, so you really only turn it on or off. There had been discussion about doing something more complex for when multiple gene predictors are present and support each other. But for now it is an on/off parameter. ?Carson > On Apr 25, 2016, at 11:00 AM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From bmoore at genetics.utah.edu Mon Apr 25 13:46:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 25 Apr 2016 19:46:23 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 21:04:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:04:23 +0000 Subject: [maker-devel] BUSCO References: Message-ID: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> I?m posting this message to the mailing list on behalf of Ian Misner. Ian, sorry your message and subscription request hasn?t gone through. The ISP that supports all of our mailing lists including maker is having issues with the mailman software that they can?t seem to resolve, so we currently can?t approve held messages or add new subscribers. We?re in the process of working out a new mailing list option. Thanks for you patience! Begin forwarded message: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 21:12:15 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:12:15 +0000 Subject: [maker-devel] maker-revel mailing list problems Message-ID: <7157D2ED-8F5A-4B62-BA71-6DF43831FC60@genetics.utah.edu> Hi all, Just wanted to give everyone a heads up that we?re experiencing problems with our mailing list server. Our mailing lists are supplied by an external ISP and the lists and support have been great for years, but lately the admin/moderator interface won?t allow us to approve any messages flagged for moderation or approve any new subscribers. This won?t affect most of you receiving this as all non-moderated traffic seems to be unaffected, but if you notice problems please let one of the moderators know directly: Carson Holt Michael Campbell Barry Moore We?re in the process of finding and migrating to a new mailing list server. We?ll do our best to minimize disruption and let you know as soon as we have a new system in place. Thanks for your patience. Barry Moore From xvazquezc at gmail.com Mon Apr 25 21:17:46 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 26 Apr 2016 13:17:46 +1000 Subject: [maker-devel] BUSCO In-Reply-To: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> References: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> Message-ID: Having installed Augustus, BUSCO will generate the training files in the Augustus species folder. Afterwards you only need to indicate the species profile in the Maker config file as usual. BUSCO developers say that the long run produces a better profile and should be used if you run the program to train Augustus. This is the command I used python3 BUSCO_v1.1b1.py -f -c 8 --long -o Genus_species -in > /PATH/TO/ASSEMBLY/contigs.fa -l /PATH/TO/PROFILE/fungi -m genome > On 26 April 2016 at 13:04, Barry Moore wrote: > I?m posting this message to the mailing list on behalf of Ian Misner. > Ian, sorry your message and subscription request hasn?t gone through. The > ISP that supports all of our mailing lists including maker is having issues > with the mailman software that they can?t seem to resolve, so we currently > can?t approve held messages or add new subscribers. We?re in the process > of working out a new mailing list option. Thanks for you patience! > > Begin forwarded message: > > Hello, > > Are there any guidelines for using BUSCO to help train MAKER? CEGMA has > been discontinued but I used to use the cegma2zff.pl steps to use those > proteins as a training step. BUSCO seems to train Augustus but I'm not sure > what file to pass from BUSCO to MAKER for this to be properly utilized. I > didn't see anything specific about this in the archives. > ----- > *Ian Misner, Ph.D.* > Computational Genomics Specialist > Contractor, Medical Science and Computing, Inc. > Bioinformatics and Computational Biosciences Branch (BCBB) > NIH/NIAID/OD/OSMO/OCICB > 5601 Fishers Lane, Room 4A59 > Rockville, MD 20892 > Office: 301-761-6208 > Mobile: 301-704-0151 > Email: ian.misner at nih.gov > Web: BCBB Home Page > > Twitter: @NIAIDBioIT > > > Disclaimer: The information in this e-mail and any of its attachments is > confidential and may contain sensitive information. It should not be used > by anyone who is not the original intended recipient. If you have received > this e-mail in error please inform the sender and delete it from your > mailbox or any other storage devices. National Institute of Allergy and > Infectious Diseases shall not accept liability for any statements made that > are sender's own and not expressly made on behalf of the NIAID by one of > its representatives. > > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Apr 27 12:16:28 2016 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 27 Apr 2016 18:16:28 +0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Hi Qihua, In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > From carsonhh at gmail.com Wed Apr 27 12:17:22 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 27 Apr 2016 12:17:22 -0600 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <5ED1E884-9203-4409-8298-39F1D19C0CC0@gmail.com> Use maker with MPI. MPI does not just have to be on a cluster, it can be installed on a local computer or server (you probably already have it installed and don?t realize it). Instructions on how to setup MAKER with MPI are in the README and INSTALL files in the download. Example command (on a single machine 16 core server): mpiexec -n maker mpiexec -n 16 maker Run across multiple machines (ten 16 core servers): mpiexec -hostfile -n maker mpiexec -hostfile ip_list -n 160 maker The second option requires a network mounted working directory accessible to all machines. ?Carson > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > > From hcma at uci.edu Wed Apr 27 19:04:29 2016 From: hcma at uci.edu (hcma) Date: Wed, 27 Apr 2016 18:04:29 -0700 Subject: [maker-devel] Augustus training for new species Message-ID: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Hi, I would like to use Maker to generate a set for training Augustus for a new species. The steps for training SNAP is well documented, but i am still confused as to how to train Augustus using the AugustusWeb. I have used fathom and forge to generate 'export.ann' and 'export.dna'. So what i need to do next is to run zff2augustus_gbk.pl in the directory that has the export.ann and export.dna files? Then i feed the train.gb file to AugustusWeb? Please advise. Thanks Karen From xvazquezc at gmail.com Wed Apr 27 19:14:35 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:14:35 +1000 Subject: [maker-devel] Augustus training for new species In-Reply-To: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> References: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Message-ID: Is it a plant genome? If it isn't, use BUSCO. It will do the whole training in a single step. It will get your assembly fasta file and generate the species profile in the Augustus species folder. See previous thread: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ If you have a plant genome, use the "zff2augustus_gbk.pl". I have this in my files: This will take the export.dna generated by fathom and generate a *.gb file > that will be used as "training gene structure file" in a new training > submission in WebAugustus, but remember to give it a new name in the > submission, e.g. MYGENOME_v2, or Maker won't see the difference (same > name)*: > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > *this applies if you do a re-run of Augustus within Maker On 28 April 2016 at 11:04, hcma wrote: > Hi, > > I would like to use Maker to generate a set for training Augustus for a > new species. The steps for training SNAP is well documented, but i am still > confused as to how to train Augustus using the AugustusWeb. > > I have used fathom and forge to generate 'export.ann' and 'export.dna'. So > what i need to do next is to run zff2augustus_gbk.pl in the directory > that has the export.ann and export.dna files? > > Then i feed the train.gb file to AugustusWeb? > > Please advise. > > Thanks > Karen > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Wed Apr 27 19:55:13 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:55:13 +1000 Subject: [maker-devel] error with ipr_update_gff ? Message-ID: Hi, I'm following the steps in the post processing of annotations from the 2014 GMOD tutorial but when using the ipr_update_gff I get load of errors such those below: Use of uninitialized value $method in string eq at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 190, <$IN> line 228738. > Use of uninitialized value $gene_id in hash element at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 203, <$IN> line 228738. > Is this normal? Thanks, Xabier -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacqueline.atkins at nih.gov Thu Apr 28 12:55:30 2016 From: jacqueline.atkins at nih.gov (Atkins, Jacqueline (NIH/NIAID) [C]) Date: Thu, 28 Apr 2016 18:55:30 +0000 Subject: [maker-devel] Segmenation Error Message-ID: Hi Everyone, I have a user who is reporting a segmentation error.. I am not really even sure where to start. Not sure if this is related to config issues or the way in which the software is being executed. Any advice would be greatly appreciated. Here is the command mpiexec -n 50 maker maker_opts_run1.ctl maker_bopts.ctl maker_exe.ctl --Next Contig-- examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] *** Process received signal *** [ai-hpcn063:99111] Signal: Segmentation fault (11) [ai-hpcn063:99111] Signal code: Address not mapped (1) [ai-hpcn063:99111] Failing at address: (nil) examining contents of the fasta file and run log [ai-hpcn053:119610] *** Process received signal *** [ai-hpcn053:119610] Signal: Segmentation fault (11) [ai-hpcn053:119610] Signal code: Address not mapped (1) [ai-hpcn053:119610] Failing at address: (nil) [ai-hpcn053:119610] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn053:119610] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn063:99111] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log ___________________________________________ Jacqueline Atkins, Contractor Sr. HPC Engineer National Institute of Allergy and Infectious Diseases SRA International Inc., A CSRA Company office 301-451-9644, mobile 301-767- 7110 5601 Fishers Lane, 6A60, Bethesda, MD 20852 Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Fri Apr 29 12:54:07 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Sat, 30 Apr 2016 00:24:07 +0530 Subject: [maker-devel] hi prnt Message-ID: A non-text attachment was scrubbed... Name: not available Type: multipart/alternative Size: 1 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.blanchoud at otago.ac.nz Tue Apr 5 18:15:14 2016 From: simon.blanchoud at otago.ac.nz (Simon Blanchoud) Date: Wed, 06 Apr 2016 00:15:14 -0000 Subject: [maker-devel] ncRNA predictions Message-ID: <5704550C.8010602@otago.ac.nz> Hi all, I have been annotating ab initio my de novo assembly of the Botrylloides leachi genome with MAKER 2.31.8 for some time now (3rd round running as I write). For this last round, I also wanted to get some predictions for non-coding RNAs as mentioned in the maker_opts.ctl. Now that this (seems to) work properly, I thought I should share a few issues I faced with you. First of all, both tRNAscan-SE and snoscan have really really limited documentation (which I know is none of your business), which makes things a bit trickier. Second, snoscan requires an rRNA file to work (not very obvious from maker_opts.ctl), and it turns out that there is a hard-coded limit in snoscan of 100 sequences for that rRNA file (not that the error message is helpful either). Overall, this was not exactly practical as I'm assembling a de novo genome, and thus do not have these rRNA sequences. What I did (and it seems to work okay) was to pull out the closest sequences I could find from the Rfam database sequences. By combining the information from their webiste on the RF families, the taxonomy.txt file and the corresponding fasta files (all from their FTP site), I extracted (for an eukaryote organism that is), one complete sequence for each subunit i.e. RF00001, RF00002, RF01960 and RF02543. Turns out pooling more than just one makes it extremely slow to run. You might know a better approach for getting such rRNA file but it does look like a pretty sound approach to me, and might deserve a comment in maker_opts.ctl. Third, once snoscan was running, I ran into the same issue as https://groups.google.com/d/topic/maker-devel/E6BKjXx2ra0/discussion i.e. the parsing of the snoscan output crashed. After (quite) some debugging, I found out that theere is an issue in the creation of the hash table containing the hits. As I am not sure how you wanted to organize them originally, I made a wild guess and re-wrote this section of the Widget. So it might not group the hits as you wanted but at least it now runs properly (and the output appears quite correct to me). I've attached the Widget. Otherwise, thanks heaps for all the hard work, it's an amazing tool and it does work great ! Cheers, Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: snoscan.pm Type: text/x-perl-script Size: 8128 bytes Desc: not available URL: From wangyugui.wei at gmail.com Sat Apr 9 09:35:22 2016 From: wangyugui.wei at gmail.com (Yugui Wang) Date: Sat, 09 Apr 2016 15:35:22 -0000 Subject: [maker-devel] Segmentation fault of MKAER with openmpi on CentOS 7.2 Message-ID: Hi. Segmentation fault of MKAER with openmpi on CentOS 7.2. Both MAKER 2.31.8 and 3.00.0 beta have the same error. $ mpirun -mca btl ^openib -n 4 maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 39507 on node T620 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- $ file core.39505 core.39505: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/perl /bio/hpc-bio/maker-3.00.0/bin/make $ gdb /usr/bin/perl core.39505 (gdb) where #0 0x00007f0e4a7d2060 in ?? () #1 #2 0x00007f0e4a7d2060 in ?? () #3 #4 0x00007f0e4bdfba50 in mca_btl_vader_component_progress () from /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so #5 0x00007f0e63ec8eda in opal_progress () from /usr/lib64/openmpi/lib/libopen-pal.so.13 #6 0x00007f0e4a191ac5 in mca_pml_ob1_probe () from /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so #7 0x00007f0e65b0dc06 in PMPI_Probe () from /usr/lib64/openmpi/lib/libmpi.so #8 0x00007f0e59007020 in C_MPI_Recv (buf=buf at entry=0x4146b30, source=source at entry=-1, tag=tag at entry=1111) at MPI.xs:56 #9 0x00007f0e590071e3 in XS_Parallel__Application__MPI_C_MPI_Recv (my_perl=, cv=) at MPI.c:391 #10 0x00007f0e657ce39f in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so #11 0x00007f0e657c6b16 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so #12 0x00007f0e65763925 in perl_run () from /usr/lib64/perl5/CORE/libperl.so #13 0x0000000000400d99 in main () $ echo $LD_PRELOAD /usr/lib64/openmpi/lib/libmpi.so: $ echo $OMPI_MCA_mpi_warn_on_fork 0 $ rpm -qa openmpi openmpi-1.10.0-10.el7.x86_64 $ uname -a Linux T620 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1029973 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 102400 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ mpiexec --version mpiexec (OpenRTE) 1.10.0 Report bugs to http://www.open-mpi.org/community/help/ $ From h.lee12 at uq.edu.au Tue Apr 12 21:05:12 2016 From: h.lee12 at uq.edu.au (Jenny Lee) Date: Wed, 13 Apr 2016 03:05:12 -0000 Subject: [maker-devel] Reformat maker gff3 Message-ID: <1460516670248.1644@uq.edu.au> Hi all, I would like to update my maker gff3 file to only contain the genes I've decided to keep - all maker genes, a subset of abinitio genes (which have interproscan hits). I would like to also exclude the repeats information and only retain the CDS, gene, exon and mRNA - like the format we usually see in published data. I've been trying to do this manually and it gets messy. Any ideas? Thanks a lot. Regards, Jenny Lee -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Wed Apr 20 07:16:43 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Wed, 20 Apr 2016 13:16:43 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From mcampbel at cshl.edu Mon Apr 25 08:16:42 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 14:16:42 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Hi Florian, Your not off topic here. I?ve attached the paper. Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. As annotations improve you do usually see fewer total genes but they are longer. One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes Thanks, Mike On Apr 25, 2016, at 7:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- A non-text attachment was scrubbed... Name: bi0411 (1).pdf Type: application/pdf Size: 484329 bytes Desc: bi0411 (1).pdf URL: From ian.misner at nih.gov Mon Apr 25 10:20:44 2016 From: ian.misner at nih.gov (Misner, Ian (NIH/NIAID) [C]) Date: Mon, 25 Apr 2016 16:20:44 -0000 Subject: [maker-devel] BUSCO Message-ID: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Mon Apr 25 11:29:46 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:29:46 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: Hi Florian, I just looked at the code for the AED_cdf_generator.pl script and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff Mike > On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From mcampbel at cshl.edu Mon Apr 25 11:43:50 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:43:50 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: <23F95F61-E0DD-4F55-B3F0-499FC725627D@cshl.edu> I updated the AED_cdf_generator.pl script on github so it only looks at mRNA lines. The only time that it would get AEDs from the gene predictions is if pred_stats was set to 1. Was pred_stats=1 set in the maker_opts.ctl file? Thanks, Mike > On Apr 25, 2016, at 1:29 PM, Campbell, Michael wrote: > > Hi Florian, > > I just looked at the code for the AED_cdf_generator.plscript and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff > > Mike >> On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: >> >> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. >> >> On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? >> >> -Florian >> >>> Am 25.04.2016 um 17:30 schrieb Carson Holt : >>> >>> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >>> >>> ?Carson >>> >>> >>>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>>> >>>> >>>> Hi Mike, >>>> >>>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>>> >>>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>>> >>>> >>>> >>>> type X file type (count) >>>> ========================================================================================================= >>>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>>> ========================================================================================================= >>>> |CDS | 63953 | 65160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |contig | 5292 | 5292 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |exon | 60381 | 61233 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |expressed_sequence_match| 275160 | 275160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |five_prime_UTR | 9424 | 8764 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |gene | 12654 | 12235 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |mRNA | 13698 | 13137 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match | 146111 | 136852 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match_part |1704978 |1697601 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |protein_match | 421814 | 421814 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |three_prime_UTR | 6894 | 6325 | >>>> --------------------------------------------------------------------------------------------------------- >>>> >>>> >>>> regards, >>>> Florian >>>> >>>> >>>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>>> Hi Florian, >>>>> >>>>> Your not off topic here. I?ve attached the paper. >>>>> >>>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>>> >>>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>>> >>>>> As annotations improve you do usually see fewer total genes but they are longer. >>>>> >>>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>>> >>>>> Thanks, >>>>> Mike >>>>> >>>>> >>>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>>> >>>>> Hello All, >>>>> >>>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>>> >>>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>>> >>>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>>> >>>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>>> >>>>> >>>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>>> >>>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>>> >>>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>>> >>>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>>> >>>>> >>>>> >>>>> kind regards, >>>>> Florian >>>>> >>>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>>> >>>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>>> >>>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>>> >>>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>>> >>>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>>> https://github.com/mscampbell/Genome_annotation >>>>> >>>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>>> >>>>> Mike >>>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>>> >>>>> Just a quick thought >>>>> >>>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>>> >>>>> ?? >>>>> >>>>> >>>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>>> To: Florian >; maker-devel > >>>>> Cc: Campbell, Michael > >>>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>>> >>>>> The Sequence Ontology provides some tools for this: >>>>> >>>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>>> https://github.com/The-Sequence-Ontology/SOBA >>>>> >>>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>>> >>>>> >>>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>>> >>>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>>> >>>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>>> >>>>> >>>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>>> https://github.com/The-Sequence-Ontology/GAL >>>>> >>>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>>> >>>>> use GAL::Annotation; >>>>> >>>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>>> >>>>> my $features = $annot->features; >>>>> >>>>> >>>>> >>>>> my $genes = $features->search( {type => ?gene'} ); >>>>> >>>>> while (my $gene = $genes->next) { >>>>> >>>>> print $gene->feature_id . ?\t"; >>>>> >>>>> print $gene->splice_complexity . ?\n?; >>>>> >>>>> } >>>>> >>>>> } >>>>> >>>>> >>>>> Hope that helps, >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>>> >>>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>>> >>>>> >>>>> Hello All, >>>>> >>>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>>> >>>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>>> >>>>> >>>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>>> >>>>> >>>>> best regards & thanks for your input, >>>>> Florian >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> > From mcampbel at cshl.edu Tue Apr 26 08:48:10 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Tue, 26 Apr 2016 14:48:10 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571F4E29.9080103@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> <571F4E29.9080103@students.uni-mainz.de> Message-ID: <7D300E49-AEF6-424B-912D-78F9551A14B8@cshl.edu> Glad to hear it. Good luck, Mike On Apr 26, 2016, at 7:16 AM, Florian > wrote: Hello all, With the updated scripts things look much better. I get 95% of the mRNA features with <= 0.5 AED now and SOBAcl gave me a mean AED value of 0.17 / 0.16 for run 2/3. I think thats an OK result for a newly assembled genome? Thank you all for the great help, Florian On 25.04.2016 21:46, Barry Moore wrote: Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian <fdolze at students.uni-mainz.de> wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From qlian003 at ucr.edu Wed Apr 27 12:06:48 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:06:48 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> Message-ID: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Hi, Daniel I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? Thank you Qihua > On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: > > HI Qihua, > > I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >> >> Hi Michael and Daniel, >> >> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >> >> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >> >> Thank you >> Best >> Qihua > From qlian003 at ucr.edu Wed Apr 27 12:35:09 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:35:09 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Message-ID: Hi Daniel, Actually I'm blasting with both cowpea RNASeq and common bean RNASeq. And yes, the datasets are large, so it really takes me couple weeks by now and it's still on running. Do you have advices on fastening this process? Thanks Qihua > On Apr 27, 2016, at 11:16 AM, Daniel Ence wrote: > > Hi Qihua, > > In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. > > At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: >> >> Hi, Daniel >> >> I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? >> >> Thank you >> Qihua >> >> >>> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >>> >>> HI Qihua, >>> >>> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >>> >>> ~Daniel >>> >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>>> >>>> Hi Michael and Daniel, >>>> >>>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>>> >>>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>>> >>>> Thank you >>>> Best >>>> Qihua > From chenwenbo1020 at gmail.com Sat Apr 2 17:41:26 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Sat, 2 Apr 2016 19:41:26 -0400 Subject: [maker-devel] mapping annotations to a new assembly Message-ID: Hi All, Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: genome=$PATH_TO_mygenome organism_type=eukaryotic est=$PATH_TO_transcript_seq est2genome=1 est_forward=1 After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! Best regards, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Mon Apr 4 03:52:20 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Mon, 04 Apr 2016 15:22:20 +0530 Subject: [maker-devel] Photos 2 Message-ID: Envoy? de mon Galaxy S6 edge+ Orange -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 20160404_327408_resized.zip Type: application/zip Size: 2934 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 4 10:34:45 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:34:45 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: Message-ID: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. ?Carson > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > Hi All, > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > genome=$PATH_TO_mygenome > > organism_type=eukaryotic > > est=$PATH_TO_transcript_seq > > est2genome=1 > > > est_forward=1 > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > Best regards, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Apr 4 10:40:32 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 4 Apr 2016 12:40:32 -0400 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: Hi Carson, Thank you. sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. Annotation question is : Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? Thanks! Best, Wenbo 2016-04-04 12:34 GMT-04:00 Carson Holt : > Because the assembly has changed. That means that sequence can be > different, missing, or altered to break previous CDS. You can try relaxing > the filtering parameters in maker_bopts.ctl to recover more partial or > incomplete matches. Also adjust the mx intron size to allow for really long > introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the > annotation to fit the new genome, only want to update the gene position. I > used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as > input. Only 13092 gene models were in the output. Anyone know the reason? > Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Apr 4 10:42:58 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:42:58 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: <2005D161-2359-4836-965D-1007E9BADEA6@gmail.com> MAKER will report back all positions. The value in the score column can be used to see how well they match the original (range between 0 and 100). In the event of a tie, you will need to manually select one or the other. The process of mapping onto a new assembly is unfortunately not completely automated. It still requires intervention from the user in those cases. ?Carson > On Apr 4, 2016, at 10:40 AM, ??? wrote: > > Hi Carson, > > Thank you. > > sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. > > Annotation question is : > > Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? > > Thanks! > > Best, > Wenbo > > 2016-04-04 12:34 GMT-04:00 Carson Holt >: > Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? > wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Mon Apr 18 07:13:14 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Mon, 18 Apr 2016 15:13:14 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception Message-ID: <5714DD6A.1080309@ecolevol.de> Hi, while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Did not specify a Query End or Query Begin STACK: Error::throw STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 STACK: /homes/biertank/kai/maker/bin/maker:914 ----------------------------------------------------------- --> rank=2, hostname=bioinf.uni-leipzig.de ERROR: Failed while gathering ab-init output files ERROR: Chunk failed at level:1, tier_type:2 FAILED CONTIG:scaffold20_cov246 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold20_cov246 examining contents of the fasta file and run log I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. Any idea? Thank you! Kai From carsonhh at gmail.com Mon Apr 18 14:30:28 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 Apr 2016 14:30:28 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5714DD6A.1080309@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> Message-ID: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. ?Carson > On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: > > Hi, > > while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Did not specify a Query End or Query Begin > STACK: Error::throw > STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 > STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 > STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 > STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 > STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 > STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 > STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 > STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 > STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 > STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 > STACK: /homes/biertank/kai/maker/bin/maker:914 > ----------------------------------------------------------- > --> rank=2, hostname=bioinf.uni-leipzig.de > ERROR: Failed while gathering ab-init output files > ERROR: Chunk failed at level:1, tier_type:2 > FAILED CONTIG:scaffold20_cov246 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold20_cov246 > > examining contents of the fasta file and run log > > > > I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. > > Any idea? > > Thank you! > Kai > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Tue Apr 19 06:08:18 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Tue, 19 Apr 2016 14:08:18 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? Message-ID: <57161FB2.30901@students.uni-mainz.de> Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian From kai.kamm at ecolevol.de Tue Apr 19 06:36:53 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Tue, 19 Apr 2016 14:36:53 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Message-ID: <57162665.7070409@ecolevol.de> Hello, now it seems to work. I (re)installed BioPerl like so: ------------------------------------------------------------ find the name of the latest BioPerl package: cpan>d /bioperl/ .... Distribution CJFIELDS/BioPerl-1.6.901.tar.gz Distribution CJFIELDS/BioPerl-1.6.922.tar.gz Distribution CJFIELDS/BioPerl-1.6.924.tar.gz And install the most recent: cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz ---------------------------------------------------------------- Produced some error messages during install, but Maker now works. Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. And why it worked this way on my desktop. Anyway Thanks! Am 18.04.2016 um 22:30 schrieb Carson Holt: > Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. > > Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. > > ?Carson > > >> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >> >> Hi, >> >> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >> >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Did not specify a Query End or Query Begin >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >> STACK: /homes/biertank/kai/maker/bin/maker:914 >> ----------------------------------------------------------- >> --> rank=2, hostname=bioinf.uni-leipzig.de >> ERROR: Failed while gathering ab-init output files >> ERROR: Chunk failed at level:1, tier_type:2 >> FAILED CONTIG:scaffold20_cov246 >> >> ERROR: Chunk failed at level:4, tier_type:0 >> FAILED CONTIG:scaffold20_cov246 >> >> examining contents of the fasta file and run log >> >> >> >> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >> >> Any idea? >> >> Thank you! >> Kai >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Tue Apr 19 09:08:02 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:08:02 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <57161FB2.30901@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> Message-ID: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson > On Apr 19, 2016, at 6:08 AM, Florian wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 09:18:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:18:20 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <57162665.7070409@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> Message-ID: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Intall as so ?> cpan> install Bio::Perl But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. ?Carson > On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: > > Hello, > > now it seems to work. I (re)installed BioPerl like so: > > ------------------------------------------------------------ > find the name of the latest BioPerl package: > > cpan>d /bioperl/ > > .... > > Distribution CJFIELDS/BioPerl-1.6.901.tar.gz > Distribution CJFIELDS/BioPerl-1.6.922.tar.gz > Distribution CJFIELDS/BioPerl-1.6.924.tar.gz > > And install the most recent: > > cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz > ---------------------------------------------------------------- > > Produced some error messages during install, but Maker now works. > > Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. > > And why it worked this way on my desktop. > > Anyway > Thanks! > > > Am 18.04.2016 um 22:30 schrieb Carson Holt: >> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >> >> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >> >> ?Carson >> >> >>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>> >>> Hi, >>> >>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>> >>> >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> MSG: Did not specify a Query End or Query Begin >>> STACK: Error::throw >>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>> ----------------------------------------------------------- >>> --> rank=2, hostname=bioinf.uni-leipzig.de >>> ERROR: Failed while gathering ab-init output files >>> ERROR: Chunk failed at level:1, tier_type:2 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> ERROR: Chunk failed at level:4, tier_type:0 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> examining contents of the fasta file and run log >>> >>> >>> >>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>> >>> Any idea? >>> >>> Thank you! >>> Kai >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 09:19:10 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:19:10 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Message-ID: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. ?Carson > On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: > > Intall as so ?> > cpan> install Bio::Perl > > But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. > > ?Carson > > > >> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >> >> Hello, >> >> now it seems to work. I (re)installed BioPerl like so: >> >> ------------------------------------------------------------ >> find the name of the latest BioPerl package: >> >> cpan>d /bioperl/ >> >> .... >> >> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >> >> And install the most recent: >> >> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >> ---------------------------------------------------------------- >> >> Produced some error messages during install, but Maker now works. >> >> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >> >> And why it worked this way on my desktop. >> >> Anyway >> Thanks! >> >> >> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>> >>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>> >>> ?Carson >>> >>> >>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>> >>>> Hi, >>>> >>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>> >>>> >>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>> MSG: Did not specify a Query End or Query Begin >>>> STACK: Error::throw >>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>> ----------------------------------------------------------- >>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>> ERROR: Failed while gathering ab-init output files >>>> ERROR: Chunk failed at level:1, tier_type:2 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> ERROR: Chunk failed at level:4, tier_type:0 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> examining contents of the fasta file and run log >>>> >>>> >>>> >>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>> >>>> Any idea? >>>> >>>> Thank you! >>>> Kai >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From cjfields at illinois.edu Tue Apr 19 10:11:06 2016 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 19 Apr 2016 16:11:06 +0000 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> Message-ID: Yup. Though Bio-Root has been added back (which IIRC was the main problem with breakage on the master branch). chris > On Apr 19, 2016, at 10:19 AM, Carson Holt wrote: > > FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. > > ?Carson > >> On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: >> >> Intall as so ?> >> cpan> install Bio::Perl >> >> But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. >> >> ?Carson >> >> >> >>> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >>> >>> Hello, >>> >>> now it seems to work. I (re)installed BioPerl like so: >>> >>> ------------------------------------------------------------ >>> find the name of the latest BioPerl package: >>> >>> cpan>d /bioperl/ >>> >>> .... >>> >>> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >>> >>> And install the most recent: >>> >>> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >>> ---------------------------------------------------------------- >>> >>> Produced some error messages during install, but Maker now works. >>> >>> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >>> >>> And why it worked this way on my desktop. >>> >>> Anyway >>> Thanks! >>> >>> >>> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>>> >>>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>>> >>>> ?Carson >>>> >>>> >>>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>>> >>>>> Hi, >>>>> >>>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>>> >>>>> >>>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>>> MSG: Did not specify a Query End or Query Begin >>>>> STACK: Error::throw >>>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>>> ----------------------------------------------------------- >>>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>>> ERROR: Failed while gathering ab-init output files >>>>> ERROR: Chunk failed at level:1, tier_type:2 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> ERROR: Chunk failed at level:4, tier_type:0 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> examining contents of the fasta file and run log >>>>> >>>>> >>>>> >>>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>>> >>>>> Any idea? >>>>> >>>>> Thank you! >>>>> Kai >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Apr 19 15:36:35 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 19 Apr 2016 21:36:35 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From MEC at stowers.org Tue Apr 19 15:44:04 2016 From: MEC at stowers.org (Cook, Malcolm) Date: Tue, 19 Apr 2016 21:44:04 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtools http://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian ; maker-devel Cc: Campbell, Michael Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdolze at students.uni-mainz.de Mon Apr 25 09:05:58 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Mon, 25 Apr 2016 17:05:58 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Message-ID: <571E3256.90705@students.uni-mainz.de> Hi Mike, We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. type X file type (count) ========================================================================================================= | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| ========================================================================================================= |CDS | 63953 | 65160 | +------------------------+---------------------------------------+--------------------------------------+ |contig | 5292 | 5292 | +------------------------+---------------------------------------+--------------------------------------+ |exon | 60381 | 61233 | +------------------------+---------------------------------------+--------------------------------------+ |expressed_sequence_match| 275160 | 275160 | +------------------------+---------------------------------------+--------------------------------------+ |five_prime_UTR | 9424 | 8764 | +------------------------+---------------------------------------+--------------------------------------+ |gene | 12654 | 12235 | +------------------------+---------------------------------------+--------------------------------------+ |mRNA | 13698 | 13137 | +------------------------+---------------------------------------+--------------------------------------+ |match | 146111 | 136852 | +------------------------+---------------------------------------+--------------------------------------+ |match_part |1704978 |1697601 | +------------------------+---------------------------------------+--------------------------------------+ |protein_match | 421814 | 421814 | +------------------------+---------------------------------------+--------------------------------------+ |three_prime_UTR | 6894 | 6325 | --------------------------------------------------------------------------------------------------------- regards, Florian On 25.04.2016 16:16, Campbell, Michael wrote: > Hi Florian, > > Your not off topic here. I?ve attached the paper. > > Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? > > The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. > > As annotations improve you do usually see fewer total genes but they are longer. > > One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes > > Thanks, > Mike > > > On Apr 25, 2016, at 7:55 AM, Florian > wrote: > > Hello All, > > First off, thank you all for your input! I took a look at all your suggestions and have some questions: > > The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: > > scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); > Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? > > For the moment I will take a look at GAL, though perl is not my strongest language. > > > For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. > > The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? > > You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? > > I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. > > > > kind regards, > Florian > > On 20.04.2016 15:16, Campbell, Michael wrote: > > I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. > > MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. > > There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. > https://github.com/mscampbell/Genome_annotation > > The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. > > Mike > On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: > > Just a quick thought > > The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html > > ?? > > > From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore > Sent: Tuesday, April 19, 2016 4:37 PM > To: Florian >; maker-devel > > Cc: Campbell, Michael > > Subject: Re: [maker-devel] A way to compare 2 annotation runs? > > The Sequence Ontology provides some tools for this: > > SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. > https://github.com/The-Sequence-Ontology/SOBA > > This simple example provides a table for two GFF3 files of the count of feature types: > > > SOBAcl --columns file --rows type --data type --data_type count \ > > data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff > > More complex examples are available in the test file SOBA/t/sobacl_test.sh > > > The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own > https://github.com/The-Sequence-Ontology/GAL > > If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: > > use GAL::Annotation; > > my $annot = GAL::Annotation->new(qw(file.gff file.fasta); > > my $features = $annot->features; > > > > my $genes = $features->search( {type => ?gene'} ); > > while (my $gene = $genes->next) { > > print $gene->feature_id . ?\t"; > > print $gene->splice_complexity . ?\n?; > > } > > } > > > Hope that helps, > > Barry > > > > On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: > > I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. > > ?Carson > > > > > On Apr 19, 2016, at 6:08 AM, Florian > wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts_run2.log Type: text/x-log Size: 4937 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 25 09:30:24 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 09:30:24 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E3256.90705@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> Message-ID: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. ?Carson > On Apr 25, 2016, at 9:05 AM, Florian wrote: > > > Hi Mike, > > We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. > > We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. > > > > type X file type (count) > ========================================================================================================= > | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| > ========================================================================================================= > |CDS | 63953 | 65160 | > +------------------------+---------------------------------------+--------------------------------------+ > |contig | 5292 | 5292 | > +------------------------+---------------------------------------+--------------------------------------+ > |exon | 60381 | 61233 | > +------------------------+---------------------------------------+--------------------------------------+ > |expressed_sequence_match| 275160 | 275160 | > +------------------------+---------------------------------------+--------------------------------------+ > |five_prime_UTR | 9424 | 8764 | > +------------------------+---------------------------------------+--------------------------------------+ > |gene | 12654 | 12235 | > +------------------------+---------------------------------------+--------------------------------------+ > |mRNA | 13698 | 13137 | > +------------------------+---------------------------------------+--------------------------------------+ > |match | 146111 | 136852 | > +------------------------+---------------------------------------+--------------------------------------+ > |match_part |1704978 |1697601 | > +------------------------+---------------------------------------+--------------------------------------+ > |protein_match | 421814 | 421814 | > +------------------------+---------------------------------------+--------------------------------------+ > |three_prime_UTR | 6894 | 6325 | > --------------------------------------------------------------------------------------------------------- > > > regards, > Florian > > > On 25.04.2016 16:16, Campbell, Michael wrote: >> Hi Florian, >> >> Your not off topic here. I?ve attached the paper. >> >> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >> >> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >> >> As annotations improve you do usually see fewer total genes but they are longer. >> >> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >> >> Thanks, >> Mike >> >> >> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >> >> Hello All, >> >> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >> >> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >> >> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >> >> For the moment I will take a look at GAL, though perl is not my strongest language. >> >> >> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >> >> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >> >> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >> >> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >> >> >> >> kind regards, >> Florian >> >> On 20.04.2016 15:16, Campbell, Michael wrote: >> >> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >> >> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >> >> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >> >> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >> https://github.com/mscampbell/Genome_annotation >> >> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >> >> Mike >> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >> >> Just a quick thought >> >> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >> >> ?? >> >> >> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >> Sent: Tuesday, April 19, 2016 4:37 PM >> To: Florian >; maker-devel > >> Cc: Campbell, Michael > >> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >> >> The Sequence Ontology provides some tools for this: >> >> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >> https://github.com/The-Sequence-Ontology/SOBA >> >> This simple example provides a table for two GFF3 files of the count of feature types: >> >> >> SOBAcl --columns file --rows type --data type --data_type count \ >> >> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >> >> More complex examples are available in the test file SOBA/t/sobacl_test.sh >> >> >> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >> https://github.com/The-Sequence-Ontology/GAL >> >> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >> >> use GAL::Annotation; >> >> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >> >> my $features = $annot->features; >> >> >> >> my $genes = $features->search( {type => ?gene'} ); >> >> while (my $gene = $genes->next) { >> >> print $gene->feature_id . ?\t"; >> >> print $gene->splice_complexity . ?\n?; >> >> } >> >> } >> >> >> Hope that helps, >> >> Barry >> >> >> >> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >> >> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >> >> ?Carson >> >> >> >> >> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >> >> >> Hello All, >> >> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >> >> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >> >> >> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >> >> >> best regards & thanks for your input, >> Florian >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Mon Apr 25 11:00:15 2016 From: fdolze at students.uni-mainz.de (Dolze, Florian) Date: Mon, 25 Apr 2016 17:00:15 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de>, <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> Message-ID: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? -Florian > Am 25.04.2016 um 17:30 schrieb Carson Holt : > > If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. > > ?Carson > > >> On Apr 25, 2016, at 9:05 AM, Florian wrote: >> >> >> Hi Mike, >> >> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >> >> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >> >> >> >> type X file type (count) >> ========================================================================================================= >> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >> ========================================================================================================= >> |CDS | 63953 | 65160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |contig | 5292 | 5292 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |exon | 60381 | 61233 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |expressed_sequence_match| 275160 | 275160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |five_prime_UTR | 9424 | 8764 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |gene | 12654 | 12235 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |mRNA | 13698 | 13137 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match | 146111 | 136852 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match_part |1704978 |1697601 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |protein_match | 421814 | 421814 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |three_prime_UTR | 6894 | 6325 | >> --------------------------------------------------------------------------------------------------------- >> >> >> regards, >> Florian >> >> >>> On 25.04.2016 16:16, Campbell, Michael wrote: >>> Hi Florian, >>> >>> Your not off topic here. I?ve attached the paper. >>> >>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>> >>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>> >>> As annotations improve you do usually see fewer total genes but they are longer. >>> >>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>> >>> Thanks, >>> Mike >>> >>> >>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>> >>> Hello All, >>> >>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>> >>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>> >>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>> >>> For the moment I will take a look at GAL, though perl is not my strongest language. >>> >>> >>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>> >>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>> >>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>> >>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>> >>> >>> >>> kind regards, >>> Florian >>> >>> On 20.04.2016 15:16, Campbell, Michael wrote: >>> >>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>> >>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>> >>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>> >>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>> https://github.com/mscampbell/Genome_annotation >>> >>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>> >>> Mike >>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>> >>> Just a quick thought >>> >>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>> >>> ?? >>> >>> >>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>> Sent: Tuesday, April 19, 2016 4:37 PM >>> To: Florian >; maker-devel > >>> Cc: Campbell, Michael > >>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>> >>> The Sequence Ontology provides some tools for this: >>> >>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>> https://github.com/The-Sequence-Ontology/SOBA >>> >>> This simple example provides a table for two GFF3 files of the count of feature types: >>> >>> >>> SOBAcl --columns file --rows type --data type --data_type count \ >>> >>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>> >>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>> >>> >>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>> https://github.com/The-Sequence-Ontology/GAL >>> >>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>> >>> use GAL::Annotation; >>> >>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>> >>> my $features = $annot->features; >>> >>> >>> >>> my $genes = $features->search( {type => ?gene'} ); >>> >>> while (my $gene = $genes->next) { >>> >>> print $gene->feature_id . ?\t"; >>> >>> print $gene->splice_complexity . ?\n?; >>> >>> } >>> >>> } >>> >>> >>> Hope that helps, >>> >>> Barry >>> >>> >>> >>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>> >>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>> >>> ?Carson >>> >>> >>> >>> >>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>> >>> >>> Hello All, >>> >>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>> >>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>> >>> >>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>> >>> >>> best regards & thanks for your input, >>> Florian >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Apr 25 11:03:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 11:03:32 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: keep_preds can be set to 0 or 1 right now. By definition anything not kept has an AED of 1, so you really only turn it on or off. There had been discussion about doing something more complex for when multiple gene predictors are present and support each other. But for now it is an on/off parameter. ?Carson > On Apr 25, 2016, at 11:00 AM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From bmoore at genetics.utah.edu Mon Apr 25 13:46:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 25 Apr 2016 19:46:23 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 21:04:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:04:23 +0000 Subject: [maker-devel] BUSCO References: Message-ID: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> I?m posting this message to the mailing list on behalf of Ian Misner. Ian, sorry your message and subscription request hasn?t gone through. The ISP that supports all of our mailing lists including maker is having issues with the mailman software that they can?t seem to resolve, so we currently can?t approve held messages or add new subscribers. We?re in the process of working out a new mailing list option. Thanks for you patience! Begin forwarded message: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 21:12:15 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:12:15 +0000 Subject: [maker-devel] maker-revel mailing list problems Message-ID: <7157D2ED-8F5A-4B62-BA71-6DF43831FC60@genetics.utah.edu> Hi all, Just wanted to give everyone a heads up that we?re experiencing problems with our mailing list server. Our mailing lists are supplied by an external ISP and the lists and support have been great for years, but lately the admin/moderator interface won?t allow us to approve any messages flagged for moderation or approve any new subscribers. This won?t affect most of you receiving this as all non-moderated traffic seems to be unaffected, but if you notice problems please let one of the moderators know directly: Carson Holt Michael Campbell Barry Moore We?re in the process of finding and migrating to a new mailing list server. We?ll do our best to minimize disruption and let you know as soon as we have a new system in place. Thanks for your patience. Barry Moore From xvazquezc at gmail.com Mon Apr 25 21:17:46 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 26 Apr 2016 13:17:46 +1000 Subject: [maker-devel] BUSCO In-Reply-To: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> References: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> Message-ID: Having installed Augustus, BUSCO will generate the training files in the Augustus species folder. Afterwards you only need to indicate the species profile in the Maker config file as usual. BUSCO developers say that the long run produces a better profile and should be used if you run the program to train Augustus. This is the command I used python3 BUSCO_v1.1b1.py -f -c 8 --long -o Genus_species -in > /PATH/TO/ASSEMBLY/contigs.fa -l /PATH/TO/PROFILE/fungi -m genome > On 26 April 2016 at 13:04, Barry Moore wrote: > I?m posting this message to the mailing list on behalf of Ian Misner. > Ian, sorry your message and subscription request hasn?t gone through. The > ISP that supports all of our mailing lists including maker is having issues > with the mailman software that they can?t seem to resolve, so we currently > can?t approve held messages or add new subscribers. We?re in the process > of working out a new mailing list option. Thanks for you patience! > > Begin forwarded message: > > Hello, > > Are there any guidelines for using BUSCO to help train MAKER? CEGMA has > been discontinued but I used to use the cegma2zff.pl steps to use those > proteins as a training step. BUSCO seems to train Augustus but I'm not sure > what file to pass from BUSCO to MAKER for this to be properly utilized. I > didn't see anything specific about this in the archives. > ----- > *Ian Misner, Ph.D.* > Computational Genomics Specialist > Contractor, Medical Science and Computing, Inc. > Bioinformatics and Computational Biosciences Branch (BCBB) > NIH/NIAID/OD/OSMO/OCICB > 5601 Fishers Lane, Room 4A59 > Rockville, MD 20892 > Office: 301-761-6208 > Mobile: 301-704-0151 > Email: ian.misner at nih.gov > Web: BCBB Home Page > > Twitter: @NIAIDBioIT > > > Disclaimer: The information in this e-mail and any of its attachments is > confidential and may contain sensitive information. It should not be used > by anyone who is not the original intended recipient. If you have received > this e-mail in error please inform the sender and delete it from your > mailbox or any other storage devices. National Institute of Allergy and > Infectious Diseases shall not accept liability for any statements made that > are sender's own and not expressly made on behalf of the NIAID by one of > its representatives. > > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Apr 27 12:16:28 2016 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 27 Apr 2016 18:16:28 +0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Hi Qihua, In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > From carsonhh at gmail.com Wed Apr 27 12:17:22 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 27 Apr 2016 12:17:22 -0600 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <5ED1E884-9203-4409-8298-39F1D19C0CC0@gmail.com> Use maker with MPI. MPI does not just have to be on a cluster, it can be installed on a local computer or server (you probably already have it installed and don?t realize it). Instructions on how to setup MAKER with MPI are in the README and INSTALL files in the download. Example command (on a single machine 16 core server): mpiexec -n maker mpiexec -n 16 maker Run across multiple machines (ten 16 core servers): mpiexec -hostfile -n maker mpiexec -hostfile ip_list -n 160 maker The second option requires a network mounted working directory accessible to all machines. ?Carson > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > > From hcma at uci.edu Wed Apr 27 19:04:29 2016 From: hcma at uci.edu (hcma) Date: Wed, 27 Apr 2016 18:04:29 -0700 Subject: [maker-devel] Augustus training for new species Message-ID: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Hi, I would like to use Maker to generate a set for training Augustus for a new species. The steps for training SNAP is well documented, but i am still confused as to how to train Augustus using the AugustusWeb. I have used fathom and forge to generate 'export.ann' and 'export.dna'. So what i need to do next is to run zff2augustus_gbk.pl in the directory that has the export.ann and export.dna files? Then i feed the train.gb file to AugustusWeb? Please advise. Thanks Karen From xvazquezc at gmail.com Wed Apr 27 19:14:35 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:14:35 +1000 Subject: [maker-devel] Augustus training for new species In-Reply-To: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> References: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Message-ID: Is it a plant genome? If it isn't, use BUSCO. It will do the whole training in a single step. It will get your assembly fasta file and generate the species profile in the Augustus species folder. See previous thread: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ If you have a plant genome, use the "zff2augustus_gbk.pl". I have this in my files: This will take the export.dna generated by fathom and generate a *.gb file > that will be used as "training gene structure file" in a new training > submission in WebAugustus, but remember to give it a new name in the > submission, e.g. MYGENOME_v2, or Maker won't see the difference (same > name)*: > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > *this applies if you do a re-run of Augustus within Maker On 28 April 2016 at 11:04, hcma wrote: > Hi, > > I would like to use Maker to generate a set for training Augustus for a > new species. The steps for training SNAP is well documented, but i am still > confused as to how to train Augustus using the AugustusWeb. > > I have used fathom and forge to generate 'export.ann' and 'export.dna'. So > what i need to do next is to run zff2augustus_gbk.pl in the directory > that has the export.ann and export.dna files? > > Then i feed the train.gb file to AugustusWeb? > > Please advise. > > Thanks > Karen > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Wed Apr 27 19:55:13 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:55:13 +1000 Subject: [maker-devel] error with ipr_update_gff ? Message-ID: Hi, I'm following the steps in the post processing of annotations from the 2014 GMOD tutorial but when using the ipr_update_gff I get load of errors such those below: Use of uninitialized value $method in string eq at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 190, <$IN> line 228738. > Use of uninitialized value $gene_id in hash element at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 203, <$IN> line 228738. > Is this normal? Thanks, Xabier -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacqueline.atkins at nih.gov Thu Apr 28 12:55:30 2016 From: jacqueline.atkins at nih.gov (Atkins, Jacqueline (NIH/NIAID) [C]) Date: Thu, 28 Apr 2016 18:55:30 +0000 Subject: [maker-devel] Segmenation Error Message-ID: Hi Everyone, I have a user who is reporting a segmentation error.. I am not really even sure where to start. Not sure if this is related to config issues or the way in which the software is being executed. Any advice would be greatly appreciated. Here is the command mpiexec -n 50 maker maker_opts_run1.ctl maker_bopts.ctl maker_exe.ctl --Next Contig-- examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] *** Process received signal *** [ai-hpcn063:99111] Signal: Segmentation fault (11) [ai-hpcn063:99111] Signal code: Address not mapped (1) [ai-hpcn063:99111] Failing at address: (nil) examining contents of the fasta file and run log [ai-hpcn053:119610] *** Process received signal *** [ai-hpcn053:119610] Signal: Segmentation fault (11) [ai-hpcn053:119610] Signal code: Address not mapped (1) [ai-hpcn053:119610] Failing at address: (nil) [ai-hpcn053:119610] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn053:119610] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn063:99111] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log ___________________________________________ Jacqueline Atkins, Contractor Sr. HPC Engineer National Institute of Allergy and Infectious Diseases SRA International Inc., A CSRA Company office 301-451-9644, mobile 301-767- 7110 5601 Fishers Lane, 6A60, Bethesda, MD 20852 Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Fri Apr 29 12:54:07 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Sat, 30 Apr 2016 00:24:07 +0530 Subject: [maker-devel] hi prnt Message-ID: A non-text attachment was scrubbed... Name: not available Type: multipart/alternative Size: 1 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.blanchoud at otago.ac.nz Tue Apr 5 18:15:14 2016 From: simon.blanchoud at otago.ac.nz (Simon Blanchoud) Date: Wed, 06 Apr 2016 00:15:14 -0000 Subject: [maker-devel] ncRNA predictions Message-ID: <5704550C.8010602@otago.ac.nz> Hi all, I have been annotating ab initio my de novo assembly of the Botrylloides leachi genome with MAKER 2.31.8 for some time now (3rd round running as I write). For this last round, I also wanted to get some predictions for non-coding RNAs as mentioned in the maker_opts.ctl. Now that this (seems to) work properly, I thought I should share a few issues I faced with you. First of all, both tRNAscan-SE and snoscan have really really limited documentation (which I know is none of your business), which makes things a bit trickier. Second, snoscan requires an rRNA file to work (not very obvious from maker_opts.ctl), and it turns out that there is a hard-coded limit in snoscan of 100 sequences for that rRNA file (not that the error message is helpful either). Overall, this was not exactly practical as I'm assembling a de novo genome, and thus do not have these rRNA sequences. What I did (and it seems to work okay) was to pull out the closest sequences I could find from the Rfam database sequences. By combining the information from their webiste on the RF families, the taxonomy.txt file and the corresponding fasta files (all from their FTP site), I extracted (for an eukaryote organism that is), one complete sequence for each subunit i.e. RF00001, RF00002, RF01960 and RF02543. Turns out pooling more than just one makes it extremely slow to run. You might know a better approach for getting such rRNA file but it does look like a pretty sound approach to me, and might deserve a comment in maker_opts.ctl. Third, once snoscan was running, I ran into the same issue as https://groups.google.com/d/topic/maker-devel/E6BKjXx2ra0/discussion i.e. the parsing of the snoscan output crashed. After (quite) some debugging, I found out that theere is an issue in the creation of the hash table containing the hits. As I am not sure how you wanted to organize them originally, I made a wild guess and re-wrote this section of the Widget. So it might not group the hits as you wanted but at least it now runs properly (and the output appears quite correct to me). I've attached the Widget. Otherwise, thanks heaps for all the hard work, it's an amazing tool and it does work great ! Cheers, Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: snoscan.pm Type: text/x-perl-script Size: 8128 bytes Desc: not available URL: From wangyugui.wei at gmail.com Sat Apr 9 09:35:22 2016 From: wangyugui.wei at gmail.com (Yugui Wang) Date: Sat, 09 Apr 2016 15:35:22 -0000 Subject: [maker-devel] Segmentation fault of MKAER with openmpi on CentOS 7.2 Message-ID: Hi. Segmentation fault of MKAER with openmpi on CentOS 7.2. Both MAKER 2.31.8 and 3.00.0 beta have the same error. $ mpirun -mca btl ^openib -n 4 maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 39507 on node T620 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- $ file core.39505 core.39505: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/perl /bio/hpc-bio/maker-3.00.0/bin/make $ gdb /usr/bin/perl core.39505 (gdb) where #0 0x00007f0e4a7d2060 in ?? () #1 #2 0x00007f0e4a7d2060 in ?? () #3 #4 0x00007f0e4bdfba50 in mca_btl_vader_component_progress () from /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so #5 0x00007f0e63ec8eda in opal_progress () from /usr/lib64/openmpi/lib/libopen-pal.so.13 #6 0x00007f0e4a191ac5 in mca_pml_ob1_probe () from /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so #7 0x00007f0e65b0dc06 in PMPI_Probe () from /usr/lib64/openmpi/lib/libmpi.so #8 0x00007f0e59007020 in C_MPI_Recv (buf=buf at entry=0x4146b30, source=source at entry=-1, tag=tag at entry=1111) at MPI.xs:56 #9 0x00007f0e590071e3 in XS_Parallel__Application__MPI_C_MPI_Recv (my_perl=, cv=) at MPI.c:391 #10 0x00007f0e657ce39f in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so #11 0x00007f0e657c6b16 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so #12 0x00007f0e65763925 in perl_run () from /usr/lib64/perl5/CORE/libperl.so #13 0x0000000000400d99 in main () $ echo $LD_PRELOAD /usr/lib64/openmpi/lib/libmpi.so: $ echo $OMPI_MCA_mpi_warn_on_fork 0 $ rpm -qa openmpi openmpi-1.10.0-10.el7.x86_64 $ uname -a Linux T620 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1029973 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 102400 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ mpiexec --version mpiexec (OpenRTE) 1.10.0 Report bugs to http://www.open-mpi.org/community/help/ $ From h.lee12 at uq.edu.au Tue Apr 12 21:05:12 2016 From: h.lee12 at uq.edu.au (Jenny Lee) Date: Wed, 13 Apr 2016 03:05:12 -0000 Subject: [maker-devel] Reformat maker gff3 Message-ID: <1460516670248.1644@uq.edu.au> Hi all, I would like to update my maker gff3 file to only contain the genes I've decided to keep - all maker genes, a subset of abinitio genes (which have interproscan hits). I would like to also exclude the repeats information and only retain the CDS, gene, exon and mRNA - like the format we usually see in published data. I've been trying to do this manually and it gets messy. Any ideas? Thanks a lot. Regards, Jenny Lee -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Wed Apr 20 07:16:43 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Wed, 20 Apr 2016 13:16:43 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From mcampbel at cshl.edu Mon Apr 25 08:16:42 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 14:16:42 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Hi Florian, Your not off topic here. I?ve attached the paper. Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. As annotations improve you do usually see fewer total genes but they are longer. One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes Thanks, Mike On Apr 25, 2016, at 7:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- A non-text attachment was scrubbed... Name: bi0411 (1).pdf Type: application/pdf Size: 484329 bytes Desc: bi0411 (1).pdf URL: From ian.misner at nih.gov Mon Apr 25 10:20:44 2016 From: ian.misner at nih.gov (Misner, Ian (NIH/NIAID) [C]) Date: Mon, 25 Apr 2016 16:20:44 -0000 Subject: [maker-devel] BUSCO Message-ID: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Mon Apr 25 11:29:46 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:29:46 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: Hi Florian, I just looked at the code for the AED_cdf_generator.pl script and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff Mike > On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From mcampbel at cshl.edu Mon Apr 25 11:43:50 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:43:50 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: <23F95F61-E0DD-4F55-B3F0-499FC725627D@cshl.edu> I updated the AED_cdf_generator.pl script on github so it only looks at mRNA lines. The only time that it would get AEDs from the gene predictions is if pred_stats was set to 1. Was pred_stats=1 set in the maker_opts.ctl file? Thanks, Mike > On Apr 25, 2016, at 1:29 PM, Campbell, Michael wrote: > > Hi Florian, > > I just looked at the code for the AED_cdf_generator.plscript and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff > > Mike >> On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: >> >> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. >> >> On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? >> >> -Florian >> >>> Am 25.04.2016 um 17:30 schrieb Carson Holt : >>> >>> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >>> >>> ?Carson >>> >>> >>>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>>> >>>> >>>> Hi Mike, >>>> >>>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>>> >>>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>>> >>>> >>>> >>>> type X file type (count) >>>> ========================================================================================================= >>>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>>> ========================================================================================================= >>>> |CDS | 63953 | 65160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |contig | 5292 | 5292 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |exon | 60381 | 61233 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |expressed_sequence_match| 275160 | 275160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |five_prime_UTR | 9424 | 8764 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |gene | 12654 | 12235 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |mRNA | 13698 | 13137 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match | 146111 | 136852 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match_part |1704978 |1697601 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |protein_match | 421814 | 421814 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |three_prime_UTR | 6894 | 6325 | >>>> --------------------------------------------------------------------------------------------------------- >>>> >>>> >>>> regards, >>>> Florian >>>> >>>> >>>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>>> Hi Florian, >>>>> >>>>> Your not off topic here. I?ve attached the paper. >>>>> >>>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>>> >>>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>>> >>>>> As annotations improve you do usually see fewer total genes but they are longer. >>>>> >>>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>>> >>>>> Thanks, >>>>> Mike >>>>> >>>>> >>>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>>> >>>>> Hello All, >>>>> >>>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>>> >>>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>>> >>>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>>> >>>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>>> >>>>> >>>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>>> >>>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>>> >>>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>>> >>>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>>> >>>>> >>>>> >>>>> kind regards, >>>>> Florian >>>>> >>>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>>> >>>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>>> >>>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>>> >>>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>>> >>>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>>> https://github.com/mscampbell/Genome_annotation >>>>> >>>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>>> >>>>> Mike >>>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>>> >>>>> Just a quick thought >>>>> >>>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>>> >>>>> ?? >>>>> >>>>> >>>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>>> To: Florian >; maker-devel > >>>>> Cc: Campbell, Michael > >>>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>>> >>>>> The Sequence Ontology provides some tools for this: >>>>> >>>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>>> https://github.com/The-Sequence-Ontology/SOBA >>>>> >>>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>>> >>>>> >>>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>>> >>>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>>> >>>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>>> >>>>> >>>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>>> https://github.com/The-Sequence-Ontology/GAL >>>>> >>>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>>> >>>>> use GAL::Annotation; >>>>> >>>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>>> >>>>> my $features = $annot->features; >>>>> >>>>> >>>>> >>>>> my $genes = $features->search( {type => ?gene'} ); >>>>> >>>>> while (my $gene = $genes->next) { >>>>> >>>>> print $gene->feature_id . ?\t"; >>>>> >>>>> print $gene->splice_complexity . ?\n?; >>>>> >>>>> } >>>>> >>>>> } >>>>> >>>>> >>>>> Hope that helps, >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>>> >>>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>>> >>>>> >>>>> Hello All, >>>>> >>>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>>> >>>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>>> >>>>> >>>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>>> >>>>> >>>>> best regards & thanks for your input, >>>>> Florian >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> > From mcampbel at cshl.edu Tue Apr 26 08:48:10 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Tue, 26 Apr 2016 14:48:10 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571F4E29.9080103@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> <571F4E29.9080103@students.uni-mainz.de> Message-ID: <7D300E49-AEF6-424B-912D-78F9551A14B8@cshl.edu> Glad to hear it. Good luck, Mike On Apr 26, 2016, at 7:16 AM, Florian > wrote: Hello all, With the updated scripts things look much better. I get 95% of the mRNA features with <= 0.5 AED now and SOBAcl gave me a mean AED value of 0.17 / 0.16 for run 2/3. I think thats an OK result for a newly assembled genome? Thank you all for the great help, Florian On 25.04.2016 21:46, Barry Moore wrote: Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian <fdolze at students.uni-mainz.de> wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From qlian003 at ucr.edu Wed Apr 27 12:06:48 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:06:48 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> Message-ID: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Hi, Daniel I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? Thank you Qihua > On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: > > HI Qihua, > > I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >> >> Hi Michael and Daniel, >> >> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >> >> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >> >> Thank you >> Best >> Qihua > From qlian003 at ucr.edu Wed Apr 27 12:35:09 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:35:09 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Message-ID: Hi Daniel, Actually I'm blasting with both cowpea RNASeq and common bean RNASeq. And yes, the datasets are large, so it really takes me couple weeks by now and it's still on running. Do you have advices on fastening this process? Thanks Qihua > On Apr 27, 2016, at 11:16 AM, Daniel Ence wrote: > > Hi Qihua, > > In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. > > At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: >> >> Hi, Daniel >> >> I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? >> >> Thank you >> Qihua >> >> >>> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >>> >>> HI Qihua, >>> >>> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >>> >>> ~Daniel >>> >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>>> >>>> Hi Michael and Daniel, >>>> >>>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>>> >>>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>>> >>>> Thank you >>>> Best >>>> Qihua > From chenwenbo1020 at gmail.com Sat Apr 2 17:41:26 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Sat, 2 Apr 2016 19:41:26 -0400 Subject: [maker-devel] mapping annotations to a new assembly Message-ID: Hi All, Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: genome=$PATH_TO_mygenome organism_type=eukaryotic est=$PATH_TO_transcript_seq est2genome=1 est_forward=1 After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! Best regards, Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Mon Apr 4 03:52:20 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Mon, 04 Apr 2016 15:22:20 +0530 Subject: [maker-devel] Photos 2 Message-ID: Envoy? de mon Galaxy S6 edge+ Orange -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 20160404_327408_resized.zip Type: application/zip Size: 2934 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 4 10:34:45 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:34:45 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: Message-ID: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. ?Carson > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > Hi All, > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > genome=$PATH_TO_mygenome > > organism_type=eukaryotic > > est=$PATH_TO_transcript_seq > > est2genome=1 > > > est_forward=1 > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > Best regards, > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Apr 4 10:40:32 2016 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 4 Apr 2016 12:40:32 -0400 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: Hi Carson, Thank you. sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. Annotation question is : Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? Thanks! Best, Wenbo 2016-04-04 12:34 GMT-04:00 Carson Holt : > Because the assembly has changed. That means that sequence can be > different, missing, or altered to break previous CDS. You can try relaxing > the filtering parameters in maker_bopts.ctl to recover more partial or > incomplete matches. Also adjust the mx intron size to allow for really long > introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the > annotation to fit the new genome, only want to update the gene position. I > used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as > input. Only 13092 gene models were in the output. Anyone know the reason? > Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Apr 4 10:42:58 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 4 Apr 2016 10:42:58 -0600 Subject: [maker-devel] mapping annotations to a new assembly In-Reply-To: References: <077DBA54-07A3-4A74-8A76-8F7E7EA246E3@gmail.com> Message-ID: <2005D161-2359-4836-965D-1007E9BADEA6@gmail.com> MAKER will report back all positions. The value in the score column can be used to see how well they match the original (range between 0 and 100). In the event of a tie, you will need to manually select one or the other. The process of mapping onto a new assembly is unfortunately not completely automated. It still requires intervention from the user in those cases. ?Carson > On Apr 4, 2016, at 10:40 AM, ??? wrote: > > Hi Carson, > > Thank you. > > sorry that I forgot to mention that in the new version assembly I only connected some scaffolds into super scaffold by Ns. > > Annotation question is : > > Maker use blast to anchor the gene. If some genes were mapped to multiple positions (for example single-exon genes), what will Maker decide to do? > > Thanks! > > Best, > Wenbo > > 2016-04-04 12:34 GMT-04:00 Carson Holt >: > Because the assembly has changed. That means that sequence can be different, missing, or altered to break previous CDS. You can try relaxing the filtering parameters in maker_bopts.ctl to recover more partial or incomplete matches. Also adjust the mx intron size to allow for really long introns. That might recover a few more. > > ?Carson > > > > > On Apr 2, 2016, at 5:41 PM, ??? > wrote: > > > > Hi All, > > > > Recently, I updated the genome assembly, and want to update the annotation to fit the new genome, only want to update the gene position. I used Maker. I changed the maker_opt.ctl file as follow: > > > > genome=$PATH_TO_mygenome > > > > organism_type=eukaryotic > > > > est=$PATH_TO_transcript_seq > > > > est2genome=1 > > > > > > est_forward=1 > > > > After run Maker, some genes were lost. There are 14,146 transcritpts as input. Only 13092 gene models were in the output. Anyone know the reason? Thank you! > > > > Best regards, > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at yandell-lab.org > > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Mon Apr 18 07:13:14 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Mon, 18 Apr 2016 15:13:14 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception Message-ID: <5714DD6A.1080309@ecolevol.de> Hi, while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Did not specify a Query End or Query Begin STACK: Error::throw STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 STACK: /homes/biertank/kai/maker/bin/maker:914 ----------------------------------------------------------- --> rank=2, hostname=bioinf.uni-leipzig.de ERROR: Failed while gathering ab-init output files ERROR: Chunk failed at level:1, tier_type:2 FAILED CONTIG:scaffold20_cov246 ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:scaffold20_cov246 examining contents of the fasta file and run log I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. Any idea? Thank you! Kai From carsonhh at gmail.com Mon Apr 18 14:30:28 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 18 Apr 2016 14:30:28 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5714DD6A.1080309@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> Message-ID: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. ?Carson > On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: > > Hi, > > while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Did not specify a Query End or Query Begin > STACK: Error::throw > STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 > STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 > STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 > STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 > STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 > STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 > STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 > STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 > STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 > STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 > STACK: /homes/biertank/kai/maker/bin/maker:914 > ----------------------------------------------------------- > --> rank=2, hostname=bioinf.uni-leipzig.de > ERROR: Failed while gathering ab-init output files > ERROR: Chunk failed at level:1, tier_type:2 > FAILED CONTIG:scaffold20_cov246 > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:scaffold20_cov246 > > examining contents of the fasta file and run log > > > > I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. > > Any idea? > > Thank you! > Kai > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Tue Apr 19 06:08:18 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Tue, 19 Apr 2016 14:08:18 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? Message-ID: <57161FB2.30901@students.uni-mainz.de> Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian From kai.kamm at ecolevol.de Tue Apr 19 06:36:53 2016 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Tue, 19 Apr 2016 14:36:53 +0200 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> Message-ID: <57162665.7070409@ecolevol.de> Hello, now it seems to work. I (re)installed BioPerl like so: ------------------------------------------------------------ find the name of the latest BioPerl package: cpan>d /bioperl/ .... Distribution CJFIELDS/BioPerl-1.6.901.tar.gz Distribution CJFIELDS/BioPerl-1.6.922.tar.gz Distribution CJFIELDS/BioPerl-1.6.924.tar.gz And install the most recent: cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz ---------------------------------------------------------------- Produced some error messages during install, but Maker now works. Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. And why it worked this way on my desktop. Anyway Thanks! Am 18.04.2016 um 22:30 schrieb Carson Holt: > Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. > > Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. > > ?Carson > > >> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >> >> Hi, >> >> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >> >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Did not specify a Query End or Query Begin >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >> STACK: /homes/biertank/kai/maker/bin/maker:914 >> ----------------------------------------------------------- >> --> rank=2, hostname=bioinf.uni-leipzig.de >> ERROR: Failed while gathering ab-init output files >> ERROR: Chunk failed at level:1, tier_type:2 >> FAILED CONTIG:scaffold20_cov246 >> >> ERROR: Chunk failed at level:4, tier_type:0 >> FAILED CONTIG:scaffold20_cov246 >> >> examining contents of the fasta file and run log >> >> >> >> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >> >> Any idea? >> >> Thank you! >> Kai >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Tue Apr 19 09:08:02 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:08:02 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <57161FB2.30901@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> Message-ID: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson > On Apr 19, 2016, at 6:08 AM, Florian wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 09:18:20 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:18:20 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <57162665.7070409@ecolevol.de> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> Message-ID: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Intall as so ?> cpan> install Bio::Perl But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. ?Carson > On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: > > Hello, > > now it seems to work. I (re)installed BioPerl like so: > > ------------------------------------------------------------ > find the name of the latest BioPerl package: > > cpan>d /bioperl/ > > .... > > Distribution CJFIELDS/BioPerl-1.6.901.tar.gz > Distribution CJFIELDS/BioPerl-1.6.922.tar.gz > Distribution CJFIELDS/BioPerl-1.6.924.tar.gz > > And install the most recent: > > cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz > ---------------------------------------------------------------- > > Produced some error messages during install, but Maker now works. > > Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. > > And why it worked this way on my desktop. > > Anyway > Thanks! > > > Am 18.04.2016 um 22:30 schrieb Carson Holt: >> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >> >> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >> >> ?Carson >> >> >>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>> >>> Hi, >>> >>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>> >>> >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> MSG: Did not specify a Query End or Query Begin >>> STACK: Error::throw >>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>> ----------------------------------------------------------- >>> --> rank=2, hostname=bioinf.uni-leipzig.de >>> ERROR: Failed while gathering ab-init output files >>> ERROR: Chunk failed at level:1, tier_type:2 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> ERROR: Chunk failed at level:4, tier_type:0 >>> FAILED CONTIG:scaffold20_cov246 >>> >>> examining contents of the fasta file and run log >>> >>> >>> >>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>> >>> Any idea? >>> >>> Thank you! >>> Kai >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Apr 19 09:19:10 2016 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Apr 2016 09:19:10 -0600 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> Message-ID: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. ?Carson > On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: > > Intall as so ?> > cpan> install Bio::Perl > > But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. > > ?Carson > > > >> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >> >> Hello, >> >> now it seems to work. I (re)installed BioPerl like so: >> >> ------------------------------------------------------------ >> find the name of the latest BioPerl package: >> >> cpan>d /bioperl/ >> >> .... >> >> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >> >> And install the most recent: >> >> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >> ---------------------------------------------------------------- >> >> Produced some error messages during install, but Maker now works. >> >> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >> >> And why it worked this way on my desktop. >> >> Anyway >> Thanks! >> >> >> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>> >>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>> >>> ?Carson >>> >>> >>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>> >>>> Hi, >>>> >>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>> >>>> >>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>> MSG: Did not specify a Query End or Query Begin >>>> STACK: Error::throw >>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>> ----------------------------------------------------------- >>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>> ERROR: Failed while gathering ab-init output files >>>> ERROR: Chunk failed at level:1, tier_type:2 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> ERROR: Chunk failed at level:4, tier_type:0 >>>> FAILED CONTIG:scaffold20_cov246 >>>> >>>> examining contents of the fasta file and run log >>>> >>>> >>>> >>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>> >>>> Any idea? >>>> >>>> Thank you! >>>> Kai >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From cjfields at illinois.edu Tue Apr 19 10:11:06 2016 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 19 Apr 2016 16:11:06 +0000 Subject: [maker-devel] Maker Failed Contigs, Bio::Root::Exception In-Reply-To: <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> References: <5714DD6A.1080309@ecolevol.de> <5249E98C-9902-4369-9B68-95F3662B61CE@gmail.com> <57162665.7070409@ecolevol.de> <8B4352FC-113E-45EC-B7D4-6983B8FF2815@gmail.com> <05666A0C-902E-4493-B107-9BE1BAF8A507@gmail.com> Message-ID: Yup. Though Bio-Root has been added back (which IIRC was the main problem with breakage on the master branch). chris > On Apr 19, 2016, at 10:19 AM, Carson Holt wrote: > > FYI. BioPerl-live is not broken. Rather it is under active development and as such cannot be considered stable. > > ?Carson > >> On Apr 19, 2016, at 9:18 AM, Carson Holt wrote: >> >> Intall as so ?> >> cpan> install Bio::Perl >> >> But it sounds like you?ve got a proper version now. Most likely you had a non-cpan version of BioPerl installed. The version it gave met the ./Build dependency requirements, but it was really a broke version. This happens if you have BioPerl-live installed for example. >> >> ?Carson >> >> >> >>> On Apr 19, 2016, at 6:36 AM, Kai Kamm wrote: >>> >>> Hello, >>> >>> now it seems to work. I (re)installed BioPerl like so: >>> >>> ------------------------------------------------------------ >>> find the name of the latest BioPerl package: >>> >>> cpan>d /bioperl/ >>> >>> .... >>> >>> Distribution CJFIELDS/BioPerl-1.6.901.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.922.tar.gz >>> Distribution CJFIELDS/BioPerl-1.6.924.tar.gz >>> >>> And install the most recent: >>> >>> cpan>install CJFIELDS/BioPerl-1.6.924.tar.gz >>> ---------------------------------------------------------------- >>> >>> Produced some error messages during install, but Maker now works. >>> >>> Just wonder why the BioPerl installation did not work properly with neither "./Build installdeps" nor via cpan>install Bundle::BioPerl. >>> >>> And why it worked this way on my desktop. >>> >>> Anyway >>> Thanks! >>> >>> >>> Am 18.04.2016 um 22:30 schrieb Carson Holt: >>>> Try updating BioPerl (use the CPAN version and not the BioPerl-live version because it will fail). Also use MAKER version 2.31.8 and not the 3.00.0-beta version. >>>> >>>> Then make sure there is not error further up. What you are seeing may be a snowball effect of the real error which could be several screens back in the text. If you are using GFF3 files as input then your format is probably incorrect. >>>> >>>> ?Carson >>>> >>>> >>>>> On Apr 18, 2016, at 7:13 AM, Kai Kamm wrote: >>>>> >>>>> Hi, >>>>> >>>>> while I have no problem running Maker on my desktop computer (Ubuntu 14.04 LTS), I always get the error below (for all contigs) when I try to run Maker on a server. >>>>> >>>>> >>>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>>> MSG: Did not specify a Query End or Query Begin >>>>> STACK: Error::throw >>>>> STACK: Bio::Root::Root::throw /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Root/Root.pm:449 >>>>> STACK: Bio::Search::HSP::GenericHSP::_query_seq_feature /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:1525 >>>>> STACK: Bio::Search::HSP::GenericHSP::query /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/GenericHSP.pm:956 >>>>> STACK: Bio::Search::HSP::HSPI::start /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/Bio/Search/HSP/HSPI.pm:504 >>>>> STACK: PhatHit_utils::add_offset /homes/biertank/kai/maker/bin/../lib/PhatHit_utils.pm:1462 >>>>> STACK: GI::parse_abinit_file /homes/biertank/kai/maker/bin/../lib/GI.pm:1199 >>>>> STACK: Process::MpiChunk::_go /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:1469 >>>>> STACK: Process::MpiChunk::run /homes/biertank/kai/maker/bin/../lib/Process/MpiChunk.pm:341 >>>>> STACK: main::node_thread /homes/biertank/kai/maker/bin/maker:1454 >>>>> STACK: threads::new /homes/biertank/kai/bin/ActivePerl-5.20/site/lib/forks.pm:799 >>>>> STACK: /homes/biertank/kai/maker/bin/maker:914 >>>>> ----------------------------------------------------------- >>>>> --> rank=2, hostname=bioinf.uni-leipzig.de >>>>> ERROR: Failed while gathering ab-init output files >>>>> ERROR: Chunk failed at level:1, tier_type:2 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> ERROR: Chunk failed at level:4, tier_type:0 >>>>> FAILED CONTIG:scaffold20_cov246 >>>>> >>>>> examining contents of the fasta file and run log >>>>> >>>>> >>>>> >>>>> I have tried to rerun "perl ./Build.PL" and then "./Build install" several times using different versions of Perl. To install the required Perl modules I have used "./Build installdeps" and I also tried installing the dependencies manually via CPAN - to no avail. >>>>> >>>>> Any idea? >>>>> >>>>> Thank you! >>>>> Kai >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Apr 19 15:36:35 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 19 Apr 2016 21:36:35 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From MEC at stowers.org Tue Apr 19 15:44:04 2016 From: MEC at stowers.org (Cook, Malcolm) Date: Tue, 19 Apr 2016 21:44:04 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtools http://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian ; maker-devel Cc: Campbell, Michael Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdolze at students.uni-mainz.de Mon Apr 25 09:05:58 2016 From: fdolze at students.uni-mainz.de (Florian) Date: Mon, 25 Apr 2016 17:05:58 +0200 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Message-ID: <571E3256.90705@students.uni-mainz.de> Hi Mike, We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. type X file type (count) ========================================================================================================= | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| ========================================================================================================= |CDS | 63953 | 65160 | +------------------------+---------------------------------------+--------------------------------------+ |contig | 5292 | 5292 | +------------------------+---------------------------------------+--------------------------------------+ |exon | 60381 | 61233 | +------------------------+---------------------------------------+--------------------------------------+ |expressed_sequence_match| 275160 | 275160 | +------------------------+---------------------------------------+--------------------------------------+ |five_prime_UTR | 9424 | 8764 | +------------------------+---------------------------------------+--------------------------------------+ |gene | 12654 | 12235 | +------------------------+---------------------------------------+--------------------------------------+ |mRNA | 13698 | 13137 | +------------------------+---------------------------------------+--------------------------------------+ |match | 146111 | 136852 | +------------------------+---------------------------------------+--------------------------------------+ |match_part |1704978 |1697601 | +------------------------+---------------------------------------+--------------------------------------+ |protein_match | 421814 | 421814 | +------------------------+---------------------------------------+--------------------------------------+ |three_prime_UTR | 6894 | 6325 | --------------------------------------------------------------------------------------------------------- regards, Florian On 25.04.2016 16:16, Campbell, Michael wrote: > Hi Florian, > > Your not off topic here. I?ve attached the paper. > > Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? > > The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. > > As annotations improve you do usually see fewer total genes but they are longer. > > One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes > > Thanks, > Mike > > > On Apr 25, 2016, at 7:55 AM, Florian > wrote: > > Hello All, > > First off, thank you all for your input! I took a look at all your suggestions and have some questions: > > The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: > > scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); > Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? > > For the moment I will take a look at GAL, though perl is not my strongest language. > > > For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. > > The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? > > You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? > > I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. > > > > kind regards, > Florian > > On 20.04.2016 15:16, Campbell, Michael wrote: > > I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. > > MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. > > There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. > http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract > > I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. > https://github.com/mscampbell/Genome_annotation > > The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. > > Mike > On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: > > Just a quick thought > > The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html > > ?? > > > From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore > Sent: Tuesday, April 19, 2016 4:37 PM > To: Florian >; maker-devel > > Cc: Campbell, Michael > > Subject: Re: [maker-devel] A way to compare 2 annotation runs? > > The Sequence Ontology provides some tools for this: > > SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. > https://github.com/The-Sequence-Ontology/SOBA > > This simple example provides a table for two GFF3 files of the count of feature types: > > > SOBAcl --columns file --rows type --data type --data_type count \ > > data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff > > More complex examples are available in the test file SOBA/t/sobacl_test.sh > > > The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own > https://github.com/The-Sequence-Ontology/GAL > > If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: > > use GAL::Annotation; > > my $annot = GAL::Annotation->new(qw(file.gff file.fasta); > > my $features = $annot->features; > > > > my $genes = $features->search( {type => ?gene'} ); > > while (my $gene = $genes->next) { > > print $gene->feature_id . ?\t"; > > print $gene->splice_complexity . ?\n?; > > } > > } > > > Hope that helps, > > Barry > > > > On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: > > I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. > > ?Carson > > > > > On Apr 19, 2016, at 6:08 AM, Florian > wrote: > > > Hello All, > > We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. > > I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. > > > So how are people assessing quality of a maker run? How do you say one run was 'better' than another? > > > best regards & thanks for your input, > Florian > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts_run2.log Type: text/x-log Size: 4937 bytes Desc: not available URL: From carsonhh at gmail.com Mon Apr 25 09:30:24 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 09:30:24 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E3256.90705@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> Message-ID: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. ?Carson > On Apr 25, 2016, at 9:05 AM, Florian wrote: > > > Hi Mike, > > We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. > > We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. > > > > type X file type (count) > ========================================================================================================= > | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| > ========================================================================================================= > |CDS | 63953 | 65160 | > +------------------------+---------------------------------------+--------------------------------------+ > |contig | 5292 | 5292 | > +------------------------+---------------------------------------+--------------------------------------+ > |exon | 60381 | 61233 | > +------------------------+---------------------------------------+--------------------------------------+ > |expressed_sequence_match| 275160 | 275160 | > +------------------------+---------------------------------------+--------------------------------------+ > |five_prime_UTR | 9424 | 8764 | > +------------------------+---------------------------------------+--------------------------------------+ > |gene | 12654 | 12235 | > +------------------------+---------------------------------------+--------------------------------------+ > |mRNA | 13698 | 13137 | > +------------------------+---------------------------------------+--------------------------------------+ > |match | 146111 | 136852 | > +------------------------+---------------------------------------+--------------------------------------+ > |match_part |1704978 |1697601 | > +------------------------+---------------------------------------+--------------------------------------+ > |protein_match | 421814 | 421814 | > +------------------------+---------------------------------------+--------------------------------------+ > |three_prime_UTR | 6894 | 6325 | > --------------------------------------------------------------------------------------------------------- > > > regards, > Florian > > > On 25.04.2016 16:16, Campbell, Michael wrote: >> Hi Florian, >> >> Your not off topic here. I?ve attached the paper. >> >> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >> >> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >> >> As annotations improve you do usually see fewer total genes but they are longer. >> >> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >> >> Thanks, >> Mike >> >> >> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >> >> Hello All, >> >> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >> >> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >> >> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >> >> For the moment I will take a look at GAL, though perl is not my strongest language. >> >> >> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >> >> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >> >> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >> >> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >> >> >> >> kind regards, >> Florian >> >> On 20.04.2016 15:16, Campbell, Michael wrote: >> >> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >> >> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >> >> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >> >> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >> https://github.com/mscampbell/Genome_annotation >> >> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >> >> Mike >> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >> >> Just a quick thought >> >> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >> >> ?? >> >> >> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >> Sent: Tuesday, April 19, 2016 4:37 PM >> To: Florian >; maker-devel > >> Cc: Campbell, Michael > >> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >> >> The Sequence Ontology provides some tools for this: >> >> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >> https://github.com/The-Sequence-Ontology/SOBA >> >> This simple example provides a table for two GFF3 files of the count of feature types: >> >> >> SOBAcl --columns file --rows type --data type --data_type count \ >> >> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >> >> More complex examples are available in the test file SOBA/t/sobacl_test.sh >> >> >> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >> https://github.com/The-Sequence-Ontology/GAL >> >> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >> >> use GAL::Annotation; >> >> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >> >> my $features = $annot->features; >> >> >> >> my $genes = $features->search( {type => ?gene'} ); >> >> while (my $gene = $genes->next) { >> >> print $gene->feature_id . ?\t"; >> >> print $gene->splice_complexity . ?\n?; >> >> } >> >> } >> >> >> Hope that helps, >> >> Barry >> >> >> >> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >> >> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >> >> ?Carson >> >> >> >> >> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >> >> >> Hello All, >> >> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >> >> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >> >> >> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >> >> >> best regards & thanks for your input, >> Florian >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From fdolze at students.uni-mainz.de Mon Apr 25 11:00:15 2016 From: fdolze at students.uni-mainz.de (Dolze, Florian) Date: Mon, 25 Apr 2016 17:00:15 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de>, <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> Message-ID: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? -Florian > Am 25.04.2016 um 17:30 schrieb Carson Holt : > > If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. > > ?Carson > > >> On Apr 25, 2016, at 9:05 AM, Florian wrote: >> >> >> Hi Mike, >> >> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >> >> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >> >> >> >> type X file type (count) >> ========================================================================================================= >> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >> ========================================================================================================= >> |CDS | 63953 | 65160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |contig | 5292 | 5292 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |exon | 60381 | 61233 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |expressed_sequence_match| 275160 | 275160 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |five_prime_UTR | 9424 | 8764 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |gene | 12654 | 12235 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |mRNA | 13698 | 13137 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match | 146111 | 136852 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |match_part |1704978 |1697601 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |protein_match | 421814 | 421814 | >> +------------------------+---------------------------------------+--------------------------------------+ >> |three_prime_UTR | 6894 | 6325 | >> --------------------------------------------------------------------------------------------------------- >> >> >> regards, >> Florian >> >> >>> On 25.04.2016 16:16, Campbell, Michael wrote: >>> Hi Florian, >>> >>> Your not off topic here. I?ve attached the paper. >>> >>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>> >>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>> >>> As annotations improve you do usually see fewer total genes but they are longer. >>> >>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>> >>> Thanks, >>> Mike >>> >>> >>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>> >>> Hello All, >>> >>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>> >>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>> >>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>> >>> For the moment I will take a look at GAL, though perl is not my strongest language. >>> >>> >>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>> >>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>> >>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>> >>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>> >>> >>> >>> kind regards, >>> Florian >>> >>> On 20.04.2016 15:16, Campbell, Michael wrote: >>> >>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>> >>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>> >>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>> >>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>> https://github.com/mscampbell/Genome_annotation >>> >>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>> >>> Mike >>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>> >>> Just a quick thought >>> >>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>> >>> ?? >>> >>> >>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>> Sent: Tuesday, April 19, 2016 4:37 PM >>> To: Florian >; maker-devel > >>> Cc: Campbell, Michael > >>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>> >>> The Sequence Ontology provides some tools for this: >>> >>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>> https://github.com/The-Sequence-Ontology/SOBA >>> >>> This simple example provides a table for two GFF3 files of the count of feature types: >>> >>> >>> SOBAcl --columns file --rows type --data type --data_type count \ >>> >>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>> >>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>> >>> >>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>> https://github.com/The-Sequence-Ontology/GAL >>> >>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>> >>> use GAL::Annotation; >>> >>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>> >>> my $features = $annot->features; >>> >>> >>> >>> my $genes = $features->search( {type => ?gene'} ); >>> >>> while (my $gene = $genes->next) { >>> >>> print $gene->feature_id . ?\t"; >>> >>> print $gene->splice_complexity . ?\n?; >>> >>> } >>> >>> } >>> >>> >>> Hope that helps, >>> >>> Barry >>> >>> >>> >>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>> >>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>> >>> ?Carson >>> >>> >>> >>> >>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>> >>> >>> Hello All, >>> >>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>> >>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>> >>> >>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>> >>> >>> best regards & thanks for your input, >>> Florian >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at yandell-lab.org >> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > From carsonhh at gmail.com Mon Apr 25 11:03:32 2016 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 25 Apr 2016 11:03:32 -0600 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: keep_preds can be set to 0 or 1 right now. By definition anything not kept has an AED of 1, so you really only turn it on or off. There had been discussion about doing something more complex for when multiple gene predictors are present and support each other. But for now it is an on/off parameter. ?Carson > On Apr 25, 2016, at 11:00 AM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From bmoore at genetics.utah.edu Mon Apr 25 13:46:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 25 Apr 2016 19:46:23 +0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 21:04:23 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:04:23 +0000 Subject: [maker-devel] BUSCO References: Message-ID: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> I?m posting this message to the mailing list on behalf of Ian Misner. Ian, sorry your message and subscription request hasn?t gone through. The ISP that supports all of our mailing lists including maker is having issues with the mailman software that they can?t seem to resolve, so we currently can?t approve held messages or add new subscribers. We?re in the process of working out a new mailing list option. Thanks for you patience! Begin forwarded message: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Apr 25 21:12:15 2016 From: bmoore at genetics.utah.edu (Barry Moore) Date: Tue, 26 Apr 2016 03:12:15 +0000 Subject: [maker-devel] maker-revel mailing list problems Message-ID: <7157D2ED-8F5A-4B62-BA71-6DF43831FC60@genetics.utah.edu> Hi all, Just wanted to give everyone a heads up that we?re experiencing problems with our mailing list server. Our mailing lists are supplied by an external ISP and the lists and support have been great for years, but lately the admin/moderator interface won?t allow us to approve any messages flagged for moderation or approve any new subscribers. This won?t affect most of you receiving this as all non-moderated traffic seems to be unaffected, but if you notice problems please let one of the moderators know directly: Carson Holt Michael Campbell Barry Moore We?re in the process of finding and migrating to a new mailing list server. We?ll do our best to minimize disruption and let you know as soon as we have a new system in place. Thanks for your patience. Barry Moore From xvazquezc at gmail.com Mon Apr 25 21:17:46 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 26 Apr 2016 13:17:46 +1000 Subject: [maker-devel] BUSCO In-Reply-To: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> References: <6C6AD04A-CAC0-40CA-B3C3-C42E2D11945A@genetics.utah.edu> Message-ID: Having installed Augustus, BUSCO will generate the training files in the Augustus species folder. Afterwards you only need to indicate the species profile in the Maker config file as usual. BUSCO developers say that the long run produces a better profile and should be used if you run the program to train Augustus. This is the command I used python3 BUSCO_v1.1b1.py -f -c 8 --long -o Genus_species -in > /PATH/TO/ASSEMBLY/contigs.fa -l /PATH/TO/PROFILE/fungi -m genome > On 26 April 2016 at 13:04, Barry Moore wrote: > I?m posting this message to the mailing list on behalf of Ian Misner. > Ian, sorry your message and subscription request hasn?t gone through. The > ISP that supports all of our mailing lists including maker is having issues > with the mailman software that they can?t seem to resolve, so we currently > can?t approve held messages or add new subscribers. We?re in the process > of working out a new mailing list option. Thanks for you patience! > > Begin forwarded message: > > Hello, > > Are there any guidelines for using BUSCO to help train MAKER? CEGMA has > been discontinued but I used to use the cegma2zff.pl steps to use those > proteins as a training step. BUSCO seems to train Augustus but I'm not sure > what file to pass from BUSCO to MAKER for this to be properly utilized. I > didn't see anything specific about this in the archives. > ----- > *Ian Misner, Ph.D.* > Computational Genomics Specialist > Contractor, Medical Science and Computing, Inc. > Bioinformatics and Computational Biosciences Branch (BCBB) > NIH/NIAID/OD/OSMO/OCICB > 5601 Fishers Lane, Room 4A59 > Rockville, MD 20892 > Office: 301-761-6208 > Mobile: 301-704-0151 > Email: ian.misner at nih.gov > Web: BCBB Home Page > > Twitter: @NIAIDBioIT > > > Disclaimer: The information in this e-mail and any of its attachments is > confidential and may contain sensitive information. It should not be used > by anyone who is not the original intended recipient. If you have received > this e-mail in error please inform the sender and delete it from your > mailbox or any other storage devices. National Institute of Allergy and > Infectious Diseases shall not accept liability for any statements made that > are sender's own and not expressly made on behalf of the NIAID by one of > its representatives. > > > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Wed Apr 27 12:16:28 2016 From: dence at genetics.utah.edu (Daniel Ence) Date: Wed, 27 Apr 2016 18:16:28 +0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Hi Qihua, In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? Daniel Ence Graduate Student Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > From carsonhh at gmail.com Wed Apr 27 12:17:22 2016 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 27 Apr 2016 12:17:22 -0600 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Message-ID: <5ED1E884-9203-4409-8298-39F1D19C0CC0@gmail.com> Use maker with MPI. MPI does not just have to be on a cluster, it can be installed on a local computer or server (you probably already have it installed and don?t realize it). Instructions on how to setup MAKER with MPI are in the README and INSTALL files in the download. Example command (on a single machine 16 core server): mpiexec -n maker mpiexec -n 16 maker Run across multiple machines (ten 16 core servers): mpiexec -hostfile -n maker mpiexec -hostfile ip_list -n 160 maker The second option requires a network mounted working directory accessible to all machines. ?Carson > On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: > > Hi, Daniel > > I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? > > Thank you > Qihua > > >> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >> >> HI Qihua, >> >> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >> >> ~Daniel >> >> >> Daniel Ence >> Graduate Student >> Eccles Institute of Human Genetics >> University of Utah >> 15 North 2030 East, Room 2100 >> Salt Lake City, UT 84112-5330 >> >>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>> >>> Hi Michael and Daniel, >>> >>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>> >>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>> >>> Thank you >>> Best >>> Qihua >> > > From hcma at uci.edu Wed Apr 27 19:04:29 2016 From: hcma at uci.edu (hcma) Date: Wed, 27 Apr 2016 18:04:29 -0700 Subject: [maker-devel] Augustus training for new species Message-ID: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Hi, I would like to use Maker to generate a set for training Augustus for a new species. The steps for training SNAP is well documented, but i am still confused as to how to train Augustus using the AugustusWeb. I have used fathom and forge to generate 'export.ann' and 'export.dna'. So what i need to do next is to run zff2augustus_gbk.pl in the directory that has the export.ann and export.dna files? Then i feed the train.gb file to AugustusWeb? Please advise. Thanks Karen From xvazquezc at gmail.com Wed Apr 27 19:14:35 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:14:35 +1000 Subject: [maker-devel] Augustus training for new species In-Reply-To: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> References: <4c7e0e58e9b55798bd255238f8ff9ae2@uci.edu> Message-ID: Is it a plant genome? If it isn't, use BUSCO. It will do the whole training in a single step. It will get your assembly fasta file and generate the species profile in the Augustus species folder. See previous thread: https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ If you have a plant genome, use the "zff2augustus_gbk.pl". I have this in my files: This will take the export.dna generated by fathom and generate a *.gb file > that will be used as "training gene structure file" in a new training > submission in WebAugustus, but remember to give it a new name in the > submission, e.g. MYGENOME_v2, or Maker won't see the difference (same > name)*: > perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME.train.gb > *this applies if you do a re-run of Augustus within Maker On 28 April 2016 at 11:04, hcma wrote: > Hi, > > I would like to use Maker to generate a set for training Augustus for a > new species. The steps for training SNAP is well documented, but i am still > confused as to how to train Augustus using the AugustusWeb. > > I have used fathom and forge to generate 'export.ann' and 'export.dna'. So > what i need to do next is to run zff2augustus_gbk.pl in the directory > that has the export.ann and export.dna files? > > Then i feed the train.gb file to AugustusWeb? > > Please advise. > > Thanks > Karen > > _______________________________________________ > maker-devel mailing list > maker-devel at yandell-lab.org > http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Wed Apr 27 19:55:13 2016 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Thu, 28 Apr 2016 11:55:13 +1000 Subject: [maker-devel] error with ipr_update_gff ? Message-ID: Hi, I'm following the steps in the post processing of annotations from the 2014 GMOD tutorial but when using the ipr_update_gff I get load of errors such those below: Use of uninitialized value $method in string eq at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 190, <$IN> line 228738. > Use of uninitialized value $gene_id in hash element at > /share/apps/maker/2.31.6/bin/ipr_update_gff line 203, <$IN> line 228738. > Is this normal? Thanks, Xabier -- Xabier V?zquez-Campos, *PhD* *Research Associate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacqueline.atkins at nih.gov Thu Apr 28 12:55:30 2016 From: jacqueline.atkins at nih.gov (Atkins, Jacqueline (NIH/NIAID) [C]) Date: Thu, 28 Apr 2016 18:55:30 +0000 Subject: [maker-devel] Segmenation Error Message-ID: Hi Everyone, I have a user who is reporting a segmentation error.. I am not really even sure where to start. Not sure if this is related to config issues or the way in which the software is being executed. Any advice would be greatly appreciated. Here is the command mpiexec -n 50 maker maker_opts_run1.ctl maker_bopts.ctl maker_exe.ctl --Next Contig-- examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] *** Process received signal *** [ai-hpcn063:99111] Signal: Segmentation fault (11) [ai-hpcn063:99111] Signal code: Address not mapped (1) [ai-hpcn063:99111] Failing at address: (nil) examining contents of the fasta file and run log [ai-hpcn053:119610] *** Process received signal *** [ai-hpcn053:119610] Signal: Segmentation fault (11) [ai-hpcn053:119610] Signal code: Address not mapped (1) [ai-hpcn053:119610] Failing at address: (nil) [ai-hpcn053:119610] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn053:119610] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log [ai-hpcn063:99111] [ 0] /lib64/libc.so.6(+0x35a00)[0x2aaaab85ca00] [ai-hpcn063:99111] *** End of error message *** examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log examining contents of the fasta file and run log ___________________________________________ Jacqueline Atkins, Contractor Sr. HPC Engineer National Institute of Allergy and Infectious Diseases SRA International Inc., A CSRA Company office 301-451-9644, mobile 301-767- 7110 5601 Fishers Lane, 6A60, Bethesda, MD 20852 Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From maker-devel at yandell-lab.org Fri Apr 29 12:54:07 2016 From: maker-devel at yandell-lab.org (maker-devel) Date: Sat, 30 Apr 2016 00:24:07 +0530 Subject: [maker-devel] hi prnt Message-ID: A non-text attachment was scrubbed... Name: not available Type: multipart/alternative Size: 1 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.blanchoud at otago.ac.nz Tue Apr 5 18:15:14 2016 From: simon.blanchoud at otago.ac.nz (Simon Blanchoud) Date: Wed, 06 Apr 2016 00:15:14 -0000 Subject: [maker-devel] ncRNA predictions Message-ID: <5704550C.8010602@otago.ac.nz> Hi all, I have been annotating ab initio my de novo assembly of the Botrylloides leachi genome with MAKER 2.31.8 for some time now (3rd round running as I write). For this last round, I also wanted to get some predictions for non-coding RNAs as mentioned in the maker_opts.ctl. Now that this (seems to) work properly, I thought I should share a few issues I faced with you. First of all, both tRNAscan-SE and snoscan have really really limited documentation (which I know is none of your business), which makes things a bit trickier. Second, snoscan requires an rRNA file to work (not very obvious from maker_opts.ctl), and it turns out that there is a hard-coded limit in snoscan of 100 sequences for that rRNA file (not that the error message is helpful either). Overall, this was not exactly practical as I'm assembling a de novo genome, and thus do not have these rRNA sequences. What I did (and it seems to work okay) was to pull out the closest sequences I could find from the Rfam database sequences. By combining the information from their webiste on the RF families, the taxonomy.txt file and the corresponding fasta files (all from their FTP site), I extracted (for an eukaryote organism that is), one complete sequence for each subunit i.e. RF00001, RF00002, RF01960 and RF02543. Turns out pooling more than just one makes it extremely slow to run. You might know a better approach for getting such rRNA file but it does look like a pretty sound approach to me, and might deserve a comment in maker_opts.ctl. Third, once snoscan was running, I ran into the same issue as https://groups.google.com/d/topic/maker-devel/E6BKjXx2ra0/discussion i.e. the parsing of the snoscan output crashed. After (quite) some debugging, I found out that theere is an issue in the creation of the hash table containing the hits. As I am not sure how you wanted to organize them originally, I made a wild guess and re-wrote this section of the Widget. So it might not group the hits as you wanted but at least it now runs properly (and the output appears quite correct to me). I've attached the Widget. Otherwise, thanks heaps for all the hard work, it's an amazing tool and it does work great ! Cheers, Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: snoscan.pm Type: text/x-perl-script Size: 8128 bytes Desc: not available URL: From wangyugui.wei at gmail.com Sat Apr 9 09:35:22 2016 From: wangyugui.wei at gmail.com (Yugui Wang) Date: Sat, 09 Apr 2016 15:35:22 -0000 Subject: [maker-devel] Segmentation fault of MKAER with openmpi on CentOS 7.2 Message-ID: Hi. Segmentation fault of MKAER with openmpi on CentOS 7.2. Both MAKER 2.31.8 and 3.00.0 beta have the same error. $ mpirun -mca btl ^openib -n 4 maker STATUS: Parsing control files... STATUS: Processing and indexing input FASTA files... -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 39507 on node T620 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- $ file core.39505 core.39505: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/perl /bio/hpc-bio/maker-3.00.0/bin/make $ gdb /usr/bin/perl core.39505 (gdb) where #0 0x00007f0e4a7d2060 in ?? () #1 #2 0x00007f0e4a7d2060 in ?? () #3 #4 0x00007f0e4bdfba50 in mca_btl_vader_component_progress () from /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so #5 0x00007f0e63ec8eda in opal_progress () from /usr/lib64/openmpi/lib/libopen-pal.so.13 #6 0x00007f0e4a191ac5 in mca_pml_ob1_probe () from /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so #7 0x00007f0e65b0dc06 in PMPI_Probe () from /usr/lib64/openmpi/lib/libmpi.so #8 0x00007f0e59007020 in C_MPI_Recv (buf=buf at entry=0x4146b30, source=source at entry=-1, tag=tag at entry=1111) at MPI.xs:56 #9 0x00007f0e590071e3 in XS_Parallel__Application__MPI_C_MPI_Recv (my_perl=, cv=) at MPI.c:391 #10 0x00007f0e657ce39f in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so #11 0x00007f0e657c6b16 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so #12 0x00007f0e65763925 in perl_run () from /usr/lib64/perl5/CORE/libperl.so #13 0x0000000000400d99 in main () $ echo $LD_PRELOAD /usr/lib64/openmpi/lib/libmpi.so: $ echo $OMPI_MCA_mpi_warn_on_fork 0 $ rpm -qa openmpi openmpi-1.10.0-10.el7.x86_64 $ uname -a Linux T620 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1029973 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 102400 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ mpiexec --version mpiexec (OpenRTE) 1.10.0 Report bugs to http://www.open-mpi.org/community/help/ $ From h.lee12 at uq.edu.au Tue Apr 12 21:05:12 2016 From: h.lee12 at uq.edu.au (Jenny Lee) Date: Wed, 13 Apr 2016 03:05:12 -0000 Subject: [maker-devel] Reformat maker gff3 Message-ID: <1460516670248.1644@uq.edu.au> Hi all, I would like to update my maker gff3 file to only contain the genes I've decided to keep - all maker genes, a subset of abinitio genes (which have interproscan hits). I would like to also exclude the repeats information and only retain the CDS, gene, exon and mRNA - like the format we usually see in published data. I've been trying to do this manually and it gets messy. Any ideas? Thanks a lot. Regards, Jenny Lee -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Wed Apr 20 07:16:43 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Wed, 20 Apr 2016 13:16:43 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> Message-ID: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From mcampbel at cshl.edu Mon Apr 25 08:16:42 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 14:16:42 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571E05B8.5080508@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> Message-ID: <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> Hi Florian, Your not off topic here. I?ve attached the paper. Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. As annotations improve you do usually see fewer total genes but they are longer. One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes Thanks, Mike On Apr 25, 2016, at 7:55 AM, Florian > wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- A non-text attachment was scrubbed... Name: bi0411 (1).pdf Type: application/pdf Size: 484329 bytes Desc: bi0411 (1).pdf URL: From ian.misner at nih.gov Mon Apr 25 10:20:44 2016 From: ian.misner at nih.gov (Misner, Ian (NIH/NIAID) [C]) Date: Mon, 25 Apr 2016 16:20:44 -0000 Subject: [maker-devel] BUSCO Message-ID: Hello, Are there any guidelines for using BUSCO to help train MAKER? CEGMA has been discontinued but I used to use the cegma2zff.pl steps to use those proteins as a training step. BUSCO seems to train Augustus but I'm not sure what file to pass from BUSCO to MAKER for this to be properly utilized. I didn't see anything specific about this in the archives. ----- Ian Misner, Ph.D. Computational Genomics Specialist Contractor, Medical Science and Computing, Inc. Bioinformatics and Computational Biosciences Branch (BCBB) NIH/NIAID/OD/OSMO/OCICB 5601 Fishers Lane, Room 4A59 Rockville, MD 20892 Office: 301-761-6208 Mobile: 301-704-0151 Email: ian.misner at nih.gov Web: BCBB Home Page Twitter: @NIAIDBioIT Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcampbel at cshl.edu Mon Apr 25 11:29:46 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:29:46 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: Hi Florian, I just looked at the code for the AED_cdf_generator.pl script and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff Mike > On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: > > This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. > > On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? > > -Florian > >> Am 25.04.2016 um 17:30 schrieb Carson Holt : >> >> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >> >> ?Carson >> >> >>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>> >>> >>> Hi Mike, >>> >>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>> >>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>> >>> >>> >>> type X file type (count) >>> ========================================================================================================= >>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>> ========================================================================================================= >>> |CDS | 63953 | 65160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |contig | 5292 | 5292 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |exon | 60381 | 61233 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |expressed_sequence_match| 275160 | 275160 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |five_prime_UTR | 9424 | 8764 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |gene | 12654 | 12235 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |mRNA | 13698 | 13137 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match | 146111 | 136852 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |match_part |1704978 |1697601 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |protein_match | 421814 | 421814 | >>> +------------------------+---------------------------------------+--------------------------------------+ >>> |three_prime_UTR | 6894 | 6325 | >>> --------------------------------------------------------------------------------------------------------- >>> >>> >>> regards, >>> Florian >>> >>> >>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>> Hi Florian, >>>> >>>> Your not off topic here. I?ve attached the paper. >>>> >>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>> >>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>> >>>> As annotations improve you do usually see fewer total genes but they are longer. >>>> >>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>> >>>> Thanks, >>>> Mike >>>> >>>> >>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>> >>>> Hello All, >>>> >>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>> >>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>> >>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>> >>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>> >>>> >>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>> >>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>> >>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>> >>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>> >>>> >>>> >>>> kind regards, >>>> Florian >>>> >>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>> >>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>> >>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>> >>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>> >>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>> https://github.com/mscampbell/Genome_annotation >>>> >>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>> >>>> Mike >>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>> >>>> Just a quick thought >>>> >>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>> >>>> ?? >>>> >>>> >>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>> To: Florian >; maker-devel > >>>> Cc: Campbell, Michael > >>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>> >>>> The Sequence Ontology provides some tools for this: >>>> >>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>> https://github.com/The-Sequence-Ontology/SOBA >>>> >>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>> >>>> >>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>> >>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>> >>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>> >>>> >>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>> https://github.com/The-Sequence-Ontology/GAL >>>> >>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>> >>>> use GAL::Annotation; >>>> >>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>> >>>> my $features = $annot->features; >>>> >>>> >>>> >>>> my $genes = $features->search( {type => ?gene'} ); >>>> >>>> while (my $gene = $genes->next) { >>>> >>>> print $gene->feature_id . ?\t"; >>>> >>>> print $gene->splice_complexity . ?\n?; >>>> >>>> } >>>> >>>> } >>>> >>>> >>>> Hope that helps, >>>> >>>> Barry >>>> >>>> >>>> >>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>> >>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>> >>>> >>>> Hello All, >>>> >>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>> >>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>> >>>> >>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>> >>>> >>>> best regards & thanks for your input, >>>> Florian >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at yandell-lab.org >>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >> From mcampbel at cshl.edu Mon Apr 25 11:43:50 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Mon, 25 Apr 2016 17:43:50 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <3F81B34E-B7FC-4E37-AFA2-514AB2A397F1@cshl.edu> <571E3256.90705@students.uni-mainz.de> <8287C01C-93C2-4BCA-9483-4EEE0E584ACD@gmail.com> <1B00E9C3-490A-4C06-A188-1F9EBC02F680@students.uni-mainz.de> Message-ID: <23F95F61-E0DD-4F55-B3F0-499FC725627D@cshl.edu> I updated the AED_cdf_generator.pl script on github so it only looks at mRNA lines. The only time that it would get AEDs from the gene predictions is if pred_stats was set to 1. Was pred_stats=1 set in the maker_opts.ctl file? Thanks, Mike > On Apr 25, 2016, at 1:29 PM, Campbell, Michael wrote: > > Hi Florian, > > I just looked at the code for the AED_cdf_generator.plscript and it is probably grabbing the AEDs off of the raw gene predictions. I?ll modify the code so it only grabs the mRNA lines. In the mean time you can use the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff > > Mike >> On Apr 25, 2016, at 1:00 PM, Dolze, Florian wrote: >> >> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. >> >> On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes? >> >> -Florian >> >>> Am 25.04.2016 um 17:30 schrieb Carson Holt : >>> >>> If you?re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features. >>> >>> ?Carson >>> >>> >>>> On Apr 25, 2016, at 9:05 AM, Florian wrote: >>>> >>>> >>>> Hi Mike, >>>> >>>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results. >>>> >>>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number. >>>> >>>> >>>> >>>> type X file type (count) >>>> ========================================================================================================= >>>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff| >>>> ========================================================================================================= >>>> |CDS | 63953 | 65160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |contig | 5292 | 5292 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |exon | 60381 | 61233 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |expressed_sequence_match| 275160 | 275160 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |five_prime_UTR | 9424 | 8764 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |gene | 12654 | 12235 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |mRNA | 13698 | 13137 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match | 146111 | 136852 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |match_part |1704978 |1697601 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |protein_match | 421814 | 421814 | >>>> +------------------------+---------------------------------------+--------------------------------------+ >>>> |three_prime_UTR | 6894 | 6325 | >>>> --------------------------------------------------------------------------------------------------------- >>>> >>>> >>>> regards, >>>> Florian >>>> >>>> >>>>> On 25.04.2016 16:16, Campbell, Michael wrote: >>>>> Hi Florian, >>>>> >>>>> Your not off topic here. I?ve attached the paper. >>>>> >>>>> Looking at the plot you sent I?m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting? >>>>> >>>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more. >>>>> >>>>> As annotations improve you do usually see fewer total genes but they are longer. >>>>> >>>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes >>>>> >>>>> Thanks, >>>>> Mike >>>>> >>>>> >>>>> On Apr 25, 2016, at 7:55 AM, Florian > wrote: >>>>> >>>>> Hello All, >>>>> >>>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions: >>>>> >>>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: >>>>> >>>>> scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); >>>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? >>>>> >>>>> For the moment I will take a look at GAL, though perl is not my strongest language. >>>>> >>>>> >>>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. >>>>> >>>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? >>>>> >>>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? >>>>> >>>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. >>>>> >>>>> >>>>> >>>>> kind regards, >>>>> Florian >>>>> >>>>> On 20.04.2016 15:16, Campbell, Michael wrote: >>>>> >>>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. >>>>> >>>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. >>>>> >>>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. >>>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract >>>>> >>>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. >>>>> https://github.com/mscampbell/Genome_annotation >>>>> >>>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. >>>>> >>>>> Mike >>>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: >>>>> >>>>> Just a quick thought >>>>> >>>>> The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html >>>>> >>>>> ?? >>>>> >>>>> >>>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore >>>>> Sent: Tuesday, April 19, 2016 4:37 PM >>>>> To: Florian >; maker-devel > >>>>> Cc: Campbell, Michael > >>>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs? >>>>> >>>>> The Sequence Ontology provides some tools for this: >>>>> >>>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. >>>>> https://github.com/The-Sequence-Ontology/SOBA >>>>> >>>>> This simple example provides a table for two GFF3 files of the count of feature types: >>>>> >>>>> >>>>> SOBAcl --columns file --rows type --data type --data_type count \ >>>>> >>>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff >>>>> >>>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh >>>>> >>>>> >>>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own >>>>> https://github.com/The-Sequence-Ontology/GAL >>>>> >>>>> If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: >>>>> >>>>> use GAL::Annotation; >>>>> >>>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta); >>>>> >>>>> my $features = $annot->features; >>>>> >>>>> >>>>> >>>>> my $genes = $features->search( {type => ?gene'} ); >>>>> >>>>> while (my $gene = $genes->next) { >>>>> >>>>> print $gene->feature_id . ?\t"; >>>>> >>>>> print $gene->splice_complexity . ?\n?; >>>>> >>>>> } >>>>> >>>>> } >>>>> >>>>> >>>>> Hope that helps, >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: >>>>> >>>>> I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> >>>>> On Apr 19, 2016, at 6:08 AM, Florian > wrote: >>>>> >>>>> >>>>> Hello All, >>>>> >>>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. >>>>> >>>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. >>>>> >>>>> >>>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another? >>>>> >>>>> >>>>> best regards & thanks for your input, >>>>> Florian >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at yandell-lab.org >>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>>>> >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at yandell-lab.org >>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org >>> > From mcampbel at cshl.edu Tue Apr 26 08:48:10 2016 From: mcampbel at cshl.edu (Campbell, Michael) Date: Tue, 26 Apr 2016 14:48:10 -0000 Subject: [maker-devel] A way to compare 2 annotation runs? In-Reply-To: <571F4E29.9080103@students.uni-mainz.de> References: <57161FB2.30901@students.uni-mainz.de> <148F7C5C-38D9-4F2C-BA0C-010E133B7F34@gmail.com> <571E05B8.5080508@students.uni-mainz.de> <349A414A-BA65-420E-9A39-5B3583993AB9@genetics.utah.edu> <571F4E29.9080103@students.uni-mainz.de> Message-ID: <7D300E49-AEF6-424B-912D-78F9551A14B8@cshl.edu> Glad to hear it. Good luck, Mike On Apr 26, 2016, at 7:16 AM, Florian > wrote: Hello all, With the updated scripts things look much better. I get 95% of the mRNA features with <= 0.5 AED now and SOBAcl gave me a mean AED value of 0.17 / 0.16 for run 2/3. I think thats an OK result for a newly assembled genome? Thank you all for the great help, Florian On 25.04.2016 21:46, Barry Moore wrote: Hi Florian, SomethinmRNA like this should work: SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you?ll need to pull an update from git for it to work correctly. Basically if you add a ?+? to the valued passed to ?data SOBAcl will treat the ?data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics. Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL). Also note if ?data_type is count then feature that have the given attribute are counted regardless of the value of the attribute. Just FYI, grabbing those values with a GAL script would look like this (untested): use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff); my $features = $annot->features; my $mRNAs = $features->search( {type => ?mRNA'} ); while (my $mRNA = $mRNAs->next) { print $mRNA->feature_id; print ?\t"; print $mRNA->attribute_value(?_AED?); print ?\n?; } } B On Apr 25, 2016, at 5:55 AM, Florian <fdolze at students.uni-mainz.de> wrote: Hello All, First off, thank you all for your input! I took a look at all your suggestions and have some questions: The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file: scaffold2278_size3634 maker mRNA 124 2128 . - . ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus); Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated? Less genes but longer averages? For the moment I will take a look at GAL, though perl is not my strongest language. For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned. The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite? You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation? I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input. kind regards, Florian On 20.04.2016 15:16, Campbell, Michael wrote: I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three. MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5. There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics. Here is a link to the paper. Let me know if you need me to send you a pdf. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful. https://github.com/mscampbell/Genome_annotation The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you. Mike On Apr 19, 2016, at 5:44 PM, Cook, Malcolm > wrote: Just a quick thought The smallest summary of what you?re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html ?? From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore Sent: Tuesday, April 19, 2016 4:37 PM To: Florian >; maker-devel > Cc: Campbell, Michael > Subject: Re: [maker-devel] A way to compare 2 annotation runs? The Sequence Ontology provides some tools for this: SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout. https://github.com/The-Sequence-Ontology/SOBA This simple example provides a table for two GFF3 files of the count of feature types: SOBAcl --columns file --rows type --data type --data_type count \ data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff More complex examples are available in the test file SOBA/t/sobacl_test.sh The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents. I has some scripts that would provide metrics along the lines of what you?re looking for, but is primarily a programing library to make it easy to roll your own https://github.com/The-Sequence-Ontology/GAL If you?re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce: use GAL::Annotation; my $annot = GAL::Annotation->new(qw(file.gff file.fasta); my $features = $annot->features; my $genes = $features->search( {type => ?gene'} ); while (my $gene = $genes->next) { print $gene->feature_id . ?\t"; print $gene->splice_complexity . ?\n?; } } Hope that helps, Barry On Apr 19, 2016, at 9:08 AM, Carson Holt > wrote: I?m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help. ?Carson On Apr 19, 2016, at 6:08 AM, Florian > wrote: Hello All, We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this. I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful. So how are people assessing quality of a maker run? How do you say one run was 'better' than another? best regards & thanks for your input, Florian _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org _______________________________________________ maker-devel mailing list maker-devel at yandell-lab.org http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org From qlian003 at ucr.edu Wed Apr 27 12:06:48 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:06:48 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> Message-ID: <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> Hi, Daniel I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? Thank you Qihua > On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: > > HI Qihua, > > I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. > > ~Daniel > > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >> >> Hi Michael and Daniel, >> >> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >> >> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >> >> Thank you >> Best >> Qihua > From qlian003 at ucr.edu Wed Apr 27 12:35:09 2016 From: qlian003 at ucr.edu (Qihua Liang) Date: Wed, 27 Apr 2016 18:35:09 -0000 Subject: [maker-devel] Maker example data for 2013 GMOD summer school In-Reply-To: <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> References: <1772AAA1-C6ED-4FCA-B4C9-39F522D3D076@genetics.utah.edu> <8F27CEB4-B16B-4BDC-BA11-5FCCBD05BC3C@ucr.edu> <2572DB54-6C29-483E-AAAB-7626FEE76DFC@genetics.utah.edu> Message-ID: Hi Daniel, Actually I'm blasting with both cowpea RNASeq and common bean RNASeq. And yes, the datasets are large, so it really takes me couple weeks by now and it's still on running. Do you have advices on fastening this process? Thanks Qihua > On Apr 27, 2016, at 11:16 AM, Daniel Ence wrote: > > Hi Qihua, > > In the maker_opts.ctl file there is an option ?cpus? which allows you to tell blast to use more than 1 cpu for blast. The comment for the line says that you should not set this higher than 1 when using MPI. I believe that the reason for this is that each thread runs blast on its own, so the number of cpus used will be the number of MPI threads X the number of cpus for blast, which can quickly get larger than the number of cpus available. > > At the same time, it?s usually not advisable to use tblastx to align large datasets because of the increased amount of time it takes. Are these RNAseq datasets from another species that you?re using tblastx for? > > Daniel Ence > Graduate Student > Eccles Institute of Human Genetics > University of Utah > 15 North 2030 East, Room 2100 > Salt Lake City, UT 84112-5330 > >> On Apr 27, 2016, at 12:06 PM, Qihua Liang wrote: >> >> Hi, Daniel >> >> I am using Maker to annotate cowpea genome for a while but now I am wondering if I could use multi-threads instead of single one? It has been running tblastx for such a long time using single thread. But I couldn?t find such settings in documentations to assign multi-threads to run Maker. Is there such an option? >> >> Thank you >> Qihua >> >> >>> On Mar 30, 2016, at 2:17 PM, Daniel Ence wrote: >>> >>> HI Qihua, >>> >>> I believe that most of the data we used in the tutorials are are available in the maker/data directory, which is included in all maker distributions. Please let me know if that isn?t the case. >>> >>> ~Daniel >>> >>> >>> Daniel Ence >>> Graduate Student >>> Eccles Institute of Human Genetics >>> University of Utah >>> 15 North 2030 East, Room 2100 >>> Salt Lake City, UT 84112-5330 >>> >>>> On Mar 30, 2016, at 3:10 PM, Qihua Liang wrote: >>>> >>>> Hi Michael and Daniel, >>>> >>>> I am a graduate student in UC Riverside, and recently I am learning to use Maker for genome annotation. I was trying to find some tutorials to follow and practice on example data, and I found out that you were giving a talk on Maker during 2013 GMOD summer school and the tutorial of that is very detailed. Nice job! >>>> >>>> But example data under the folder you mentioned as ./maker/maker_course is not provided on the website and I am wondering if they are available to the public or not. If yes, could you send me those materials so that I could follow your tutorial to practice using Maker? >>>> >>>> Thank you >>>> Best >>>> Qihua >