From carsonhh at gmail.com Sun Dec 2 17:18:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 2 Dec 2018 16:18:09 -0700 Subject: [maker-devel] RNA-seq gff file cause MAKER running arrest In-Reply-To: <2018113011234848442023@genetics.ac.cn> References: <2018113011234848442023@genetics.ac.cn> Message-ID: <0C187AB6-158C-40BB-A294-EED9C5B6FECD@gmail.com> Cufflinks2gff is not a general GTF to GFF3 converter. It will work with cufflinks, but I don?t think it will convert Stringtie. Also if you are trying to use est2genome=1, it only works with fasta files. ?Carson > On Nov 29, 2018, at 8:24 PM, ytshen at genetics.ac.cn wrote: > > Dears, > > I am trying to run MAKER for a plant genome and want to use RNA-seq information for my annotation. For this purpose, I alignment RNA-seq reads to my genome by using Hisat2 and assemble transcripts by using Stringtie, and transfer the Stringtie gtf output file into gff file by using Cufflinks2gff script provided by MAKER (ver 2.31.10). > > When I set the gff file as est_gff in maker_opts.ctl file, the MAKER running will stop at some point, but it didn't provide any error or warning information and also didn't exist. But when I transfer the gff file into fasta file, and set it as est_fasta in maker_opts.ctl file, the MAKER can run successfully. > > Has this phenomenon happened to someone else? How I can fix it? Hope your reply! > > Thank you very much! > > Best! > > Sinceraly, > Yanting > > ??? > ???????????????? > ???010-64801362 > ?????????????1?? > ???100101 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Mon Dec 17 04:20:08 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Mon, 17 Dec 2018 11:20:08 +0100 Subject: [maker-devel] Re Best annotation set failure (auto_annotator.pm line 1774) Message-ID: I got the same error as previously mentioned by Greer Dolby the 15th of June. I share my experience hopping it could help to catch the problem. .processing 0 of 5 ...processing 1 of 5 ...processing 2 of 5 ...processing 3 of 5 ...processing 4 of 5 ...processing 0 of 6 ...processing 1 of 6 ...processing 2 of 6 ...processing 3 of 6 ...processing 4 of 6 ...processing 5 of 6 adding statistics to annotations Calculating annotation quality statistics choosing best annotation set Choosing best annotations Died at /sw/bioinfo/maker/3.01.02-beta-OMPI-perl5.16/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=36, hostname=bnode-06 ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:@000261F|arrow|arrow I?m using: - MAKER 3.01.02-beta - openmpi 2.1.2 - perl 5.16.3 - BioPerl 1.6.922 - Exonerate 2.4 - blast+ 2.7.1 - evm 1.1.1 - genemark 4.3 - augustus 2.7 - repeat masker 4.0.3 As externaI data I use proteins in fasta format, transcriptomes in fasta format and transcriptomes in gff format. I have first run MAKER with est2genome and protein2genome set to 1 and single_exon=1 (important for the bug it seems). => it has worked fine Then I have run the same data with est2genome and protein2genome set to 0 and genemark, snap, augustus and evm activated and still single_exon=1 => it has worked fine. I wanted to give a last try with the same parameters as the previous run but put single_exon=0 => 2 contigs crash on the 6000 contigs my assembly has I have extracted the sequence and relaunched MAKER apart it in a fresh folder. I have exactly the same error. Best regards, Jacques -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Dec 17 11:38:10 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Dec 2018 10:38:10 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Wed Dec 12 04:29:17 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 12 Dec 2018 10:29:17 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: Message-ID: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh's lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split genes even though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinity transcript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: From carsonhh at gmail.com Tue Dec 18 09:14:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 18 Dec 2018 08:14:40 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 01:23:39 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Tue, 18 Dec 2018 07:23:39 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1416 bytes Desc: maker_bopts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5150 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SCAffold61.zip Type: application/x-zip-compressed Size: 439697 bytes Desc: SCAffold61.zip URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 21:20:13 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 19 Dec 2018 03:20:13 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> Message-ID: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Dear Carson, Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. Thanks once again for your help and time. Best Regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 11:15 PM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Dec 19 11:14:25 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 19 Dec 2018 10:14:25 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Message-ID: Given how you have plenty of protein evidence aligning well, I would suggest you do protein2genome only and not est2genome to build your training set (your est2genome results are more fragmented). You can further filter for canonical start and stop codons as well as protein completeness which will be in the score column in the GFF3. Once trained, you run maker again with augustus_species= set to your newly trained species. MAKER will run augustus for you and generate the hints for augustus using the est= and protein= files you provide. ?Carson > On Dec 18, 2018, at 8:20 PM, Prashant Narendra SHINGATE wrote: > > Dear Carson, > > Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. > > I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? > > Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. > > Thanks once again for your help and time. > > Best Regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 11:15 PM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. > > See Documentation ?> > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors > > ?Carson > > > > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Dec 2 16:18:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 2 Dec 2018 16:18:09 -0700 Subject: [maker-devel] RNA-seq gff file cause MAKER running arrest In-Reply-To: <2018113011234848442023@genetics.ac.cn> References: <2018113011234848442023@genetics.ac.cn> Message-ID: <0C187AB6-158C-40BB-A294-EED9C5B6FECD@gmail.com> Cufflinks2gff is not a general GTF to GFF3 converter. It will work with cufflinks, but I don?t think it will convert Stringtie. Also if you are trying to use est2genome=1, it only works with fasta files. ?Carson > On Nov 29, 2018, at 8:24 PM, ytshen at genetics.ac.cn wrote: > > Dears, > > I am trying to run MAKER for a plant genome and want to use RNA-seq information for my annotation. For this purpose, I alignment RNA-seq reads to my genome by using Hisat2 and assemble transcripts by using Stringtie, and transfer the Stringtie gtf output file into gff file by using Cufflinks2gff script provided by MAKER (ver 2.31.10). > > When I set the gff file as est_gff in maker_opts.ctl file, the MAKER running will stop at some point, but it didn't provide any error or warning information and also didn't exist. But when I transfer the gff file into fasta file, and set it as est_fasta in maker_opts.ctl file, the MAKER can run successfully. > > Has this phenomenon happened to someone else? How I can fix it? Hope your reply! > > Thank you very much! > > Best! > > Sinceraly, > Yanting > > ??? > ???????????????? > ???010-64801362 > ?????????????1?? > ???100101 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Mon Dec 17 03:20:08 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Mon, 17 Dec 2018 11:20:08 +0100 Subject: [maker-devel] Re Best annotation set failure (auto_annotator.pm line 1774) Message-ID: I got the same error as previously mentioned by Greer Dolby the 15th of June. I share my experience hopping it could help to catch the problem. .processing 0 of 5 ...processing 1 of 5 ...processing 2 of 5 ...processing 3 of 5 ...processing 4 of 5 ...processing 0 of 6 ...processing 1 of 6 ...processing 2 of 6 ...processing 3 of 6 ...processing 4 of 6 ...processing 5 of 6 adding statistics to annotations Calculating annotation quality statistics choosing best annotation set Choosing best annotations Died at /sw/bioinfo/maker/3.01.02-beta-OMPI-perl5.16/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=36, hostname=bnode-06 ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:@000261F|arrow|arrow I?m using: - MAKER 3.01.02-beta - openmpi 2.1.2 - perl 5.16.3 - BioPerl 1.6.922 - Exonerate 2.4 - blast+ 2.7.1 - evm 1.1.1 - genemark 4.3 - augustus 2.7 - repeat masker 4.0.3 As externaI data I use proteins in fasta format, transcriptomes in fasta format and transcriptomes in gff format. I have first run MAKER with est2genome and protein2genome set to 1 and single_exon=1 (important for the bug it seems). => it has worked fine Then I have run the same data with est2genome and protein2genome set to 0 and genemark, snap, augustus and evm activated and still single_exon=1 => it has worked fine. I wanted to give a last try with the same parameters as the previous run but put single_exon=0 => 2 contigs crash on the 6000 contigs my assembly has I have extracted the sequence and relaunched MAKER apart it in a fresh folder. I have exactly the same error. Best regards, Jacques -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Dec 17 10:38:10 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Dec 2018 10:38:10 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Wed Dec 12 03:29:17 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 12 Dec 2018 10:29:17 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: Message-ID: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh's lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split genes even though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinity transcript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: From carsonhh at gmail.com Tue Dec 18 08:14:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 18 Dec 2018 08:14:40 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 00:23:39 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Tue, 18 Dec 2018 07:23:39 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1416 bytes Desc: maker_bopts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5150 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SCAffold61.zip Type: application/x-zip-compressed Size: 439697 bytes Desc: SCAffold61.zip URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 20:20:13 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 19 Dec 2018 03:20:13 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> Message-ID: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Dear Carson, Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. Thanks once again for your help and time. Best Regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 11:15 PM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Dec 19 10:14:25 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 19 Dec 2018 10:14:25 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Message-ID: Given how you have plenty of protein evidence aligning well, I would suggest you do protein2genome only and not est2genome to build your training set (your est2genome results are more fragmented). You can further filter for canonical start and stop codons as well as protein completeness which will be in the score column in the GFF3. Once trained, you run maker again with augustus_species= set to your newly trained species. MAKER will run augustus for you and generate the hints for augustus using the est= and protein= files you provide. ?Carson > On Dec 18, 2018, at 8:20 PM, Prashant Narendra SHINGATE wrote: > > Dear Carson, > > Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. > > I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? > > Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. > > Thanks once again for your help and time. > > Best Regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 11:15 PM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. > > See Documentation ?> > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors > > ?Carson > > > > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Dec 2 16:18:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 2 Dec 2018 16:18:09 -0700 Subject: [maker-devel] RNA-seq gff file cause MAKER running arrest In-Reply-To: <2018113011234848442023@genetics.ac.cn> References: <2018113011234848442023@genetics.ac.cn> Message-ID: <0C187AB6-158C-40BB-A294-EED9C5B6FECD@gmail.com> Cufflinks2gff is not a general GTF to GFF3 converter. It will work with cufflinks, but I don?t think it will convert Stringtie. Also if you are trying to use est2genome=1, it only works with fasta files. ?Carson > On Nov 29, 2018, at 8:24 PM, ytshen at genetics.ac.cn wrote: > > Dears, > > I am trying to run MAKER for a plant genome and want to use RNA-seq information for my annotation. For this purpose, I alignment RNA-seq reads to my genome by using Hisat2 and assemble transcripts by using Stringtie, and transfer the Stringtie gtf output file into gff file by using Cufflinks2gff script provided by MAKER (ver 2.31.10). > > When I set the gff file as est_gff in maker_opts.ctl file, the MAKER running will stop at some point, but it didn't provide any error or warning information and also didn't exist. But when I transfer the gff file into fasta file, and set it as est_fasta in maker_opts.ctl file, the MAKER can run successfully. > > Has this phenomenon happened to someone else? How I can fix it? Hope your reply! > > Thank you very much! > > Best! > > Sinceraly, > Yanting > > ??? > ???????????????? > ???010-64801362 > ?????????????1?? > ???100101 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Mon Dec 17 03:20:08 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Mon, 17 Dec 2018 11:20:08 +0100 Subject: [maker-devel] Re Best annotation set failure (auto_annotator.pm line 1774) Message-ID: I got the same error as previously mentioned by Greer Dolby the 15th of June. I share my experience hopping it could help to catch the problem. .processing 0 of 5 ...processing 1 of 5 ...processing 2 of 5 ...processing 3 of 5 ...processing 4 of 5 ...processing 0 of 6 ...processing 1 of 6 ...processing 2 of 6 ...processing 3 of 6 ...processing 4 of 6 ...processing 5 of 6 adding statistics to annotations Calculating annotation quality statistics choosing best annotation set Choosing best annotations Died at /sw/bioinfo/maker/3.01.02-beta-OMPI-perl5.16/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=36, hostname=bnode-06 ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:@000261F|arrow|arrow I?m using: - MAKER 3.01.02-beta - openmpi 2.1.2 - perl 5.16.3 - BioPerl 1.6.922 - Exonerate 2.4 - blast+ 2.7.1 - evm 1.1.1 - genemark 4.3 - augustus 2.7 - repeat masker 4.0.3 As externaI data I use proteins in fasta format, transcriptomes in fasta format and transcriptomes in gff format. I have first run MAKER with est2genome and protein2genome set to 1 and single_exon=1 (important for the bug it seems). => it has worked fine Then I have run the same data with est2genome and protein2genome set to 0 and genemark, snap, augustus and evm activated and still single_exon=1 => it has worked fine. I wanted to give a last try with the same parameters as the previous run but put single_exon=0 => 2 contigs crash on the 6000 contigs my assembly has I have extracted the sequence and relaunched MAKER apart it in a fresh folder. I have exactly the same error. Best regards, Jacques -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Dec 17 10:38:10 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Dec 2018 10:38:10 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Wed Dec 12 03:29:17 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 12 Dec 2018 10:29:17 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: Message-ID: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh's lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split genes even though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinity transcript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: From carsonhh at gmail.com Tue Dec 18 08:14:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 18 Dec 2018 08:14:40 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 00:23:39 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Tue, 18 Dec 2018 07:23:39 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1416 bytes Desc: maker_bopts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5150 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SCAffold61.zip Type: application/x-zip-compressed Size: 439697 bytes Desc: SCAffold61.zip URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 20:20:13 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 19 Dec 2018 03:20:13 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> Message-ID: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Dear Carson, Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. Thanks once again for your help and time. Best Regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 11:15 PM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Dec 19 10:14:25 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 19 Dec 2018 10:14:25 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Message-ID: Given how you have plenty of protein evidence aligning well, I would suggest you do protein2genome only and not est2genome to build your training set (your est2genome results are more fragmented). You can further filter for canonical start and stop codons as well as protein completeness which will be in the score column in the GFF3. Once trained, you run maker again with augustus_species= set to your newly trained species. MAKER will run augustus for you and generate the hints for augustus using the est= and protein= files you provide. ?Carson > On Dec 18, 2018, at 8:20 PM, Prashant Narendra SHINGATE wrote: > > Dear Carson, > > Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. > > I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? > > Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. > > Thanks once again for your help and time. > > Best Regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 11:15 PM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. > > See Documentation ?> > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors > > ?Carson > > > > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Sun Dec 2 16:18:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Sun, 2 Dec 2018 16:18:09 -0700 Subject: [maker-devel] RNA-seq gff file cause MAKER running arrest In-Reply-To: <2018113011234848442023@genetics.ac.cn> References: <2018113011234848442023@genetics.ac.cn> Message-ID: <0C187AB6-158C-40BB-A294-EED9C5B6FECD@gmail.com> Cufflinks2gff is not a general GTF to GFF3 converter. It will work with cufflinks, but I don?t think it will convert Stringtie. Also if you are trying to use est2genome=1, it only works with fasta files. ?Carson > On Nov 29, 2018, at 8:24 PM, ytshen at genetics.ac.cn wrote: > > Dears, > > I am trying to run MAKER for a plant genome and want to use RNA-seq information for my annotation. For this purpose, I alignment RNA-seq reads to my genome by using Hisat2 and assemble transcripts by using Stringtie, and transfer the Stringtie gtf output file into gff file by using Cufflinks2gff script provided by MAKER (ver 2.31.10). > > When I set the gff file as est_gff in maker_opts.ctl file, the MAKER running will stop at some point, but it didn't provide any error or warning information and also didn't exist. But when I transfer the gff file into fasta file, and set it as est_fasta in maker_opts.ctl file, the MAKER can run successfully. > > Has this phenomenon happened to someone else? How I can fix it? Hope your reply! > > Thank you very much! > > Best! > > Sinceraly, > Yanting > > ??? > ???????????????? > ???010-64801362 > ?????????????1?? > ???100101 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Mon Dec 17 03:20:08 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Mon, 17 Dec 2018 11:20:08 +0100 Subject: [maker-devel] Re Best annotation set failure (auto_annotator.pm line 1774) Message-ID: I got the same error as previously mentioned by Greer Dolby the 15th of June. I share my experience hopping it could help to catch the problem. .processing 0 of 5 ...processing 1 of 5 ...processing 2 of 5 ...processing 3 of 5 ...processing 4 of 5 ...processing 0 of 6 ...processing 1 of 6 ...processing 2 of 6 ...processing 3 of 6 ...processing 4 of 6 ...processing 5 of 6 adding statistics to annotations Calculating annotation quality statistics choosing best annotation set Choosing best annotations Died at /sw/bioinfo/maker/3.01.02-beta-OMPI-perl5.16/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=36, hostname=bnode-06 ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:@000261F|arrow|arrow I?m using: - MAKER 3.01.02-beta - openmpi 2.1.2 - perl 5.16.3 - BioPerl 1.6.922 - Exonerate 2.4 - blast+ 2.7.1 - evm 1.1.1 - genemark 4.3 - augustus 2.7 - repeat masker 4.0.3 As externaI data I use proteins in fasta format, transcriptomes in fasta format and transcriptomes in gff format. I have first run MAKER with est2genome and protein2genome set to 1 and single_exon=1 (important for the bug it seems). => it has worked fine Then I have run the same data with est2genome and protein2genome set to 0 and genemark, snap, augustus and evm activated and still single_exon=1 => it has worked fine. I wanted to give a last try with the same parameters as the previous run but put single_exon=0 => 2 contigs crash on the 6000 contigs my assembly has I have extracted the sequence and relaunched MAKER apart it in a fresh folder. I have exactly the same error. Best regards, Jacques -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Dec 17 10:38:10 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 17 Dec 2018 10:38:10 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Wed Dec 12 03:29:17 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 12 Dec 2018 10:29:17 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: Message-ID: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh's lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split genes even though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinity transcript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5147 bytes Desc: maker_opts.ctl URL: From carsonhh at gmail.com Tue Dec 18 08:14:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 18 Dec 2018 08:14:40 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 00:23:39 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Tue, 18 Dec 2018 07:23:39 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> Message-ID: <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1416 bytes Desc: maker_bopts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 5150 bytes Desc: maker_opts.ctl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SCAffold61.zip Type: application/x-zip-compressed Size: 439697 bytes Desc: SCAffold61.zip URL: From prashantns at imcb.a-star.edu.sg Tue Dec 18 20:20:13 2018 From: prashantns at imcb.a-star.edu.sg (Prashant Narendra SHINGATE) Date: Wed, 19 Dec 2018 03:20:13 +0000 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> Message-ID: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Dear Carson, Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. Thanks once again for your help and time. Best Regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 11:15 PM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. See Documentation ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, Thank you for the reply. Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. I will be glad to send all reference protein and transcript sequences used for annotation, if required. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. From: Carson Holt [mailto:carsonhh at gmail.com] Sent: Tuesday, 18 December, 2018 1:38 AM To: Prashant Narendra SHINGATE Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org Subject: Re: About split genes in MAKER annotation It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. ?Carson On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: Hi Carson, I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. Thanks for your time and help. Best regards, Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ We advance science and develop innovative technology to further economic growth and improve lives. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Dec 19 10:14:25 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 19 Dec 2018 10:14:25 -0700 Subject: [maker-devel] About split genes in MAKER annotation In-Reply-To: <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> References: <75508AB460A77C4798EC49425637E29264DC6854@PETREL-MA.imcb.a-star.edu.sg> <2C242CC7-B957-4E5D-974E-71D85E67D7B0@gmail.com> <75508AB460A77C4798EC49425637E29264DC97B7@PETREL-MA.imcb.a-star.edu.sg> <75508AB460A77C4798EC49425637E29264DC97F5@PETREL-MA.imcb.a-star.edu.sg> <3E3C2D15-C733-448B-A317-23696AEF6809@gmail.com> <75508AB460A77C4798EC49425637E29264DC9E36@PETREL-MA.imcb.a-star.edu.sg> Message-ID: Given how you have plenty of protein evidence aligning well, I would suggest you do protein2genome only and not est2genome to build your training set (your est2genome results are more fragmented). You can further filter for canonical start and stop codons as well as protein completeness which will be in the score column in the GFF3. Once trained, you run maker again with augustus_species= set to your newly trained species. MAKER will run augustus for you and generate the hints for augustus using the est= and protein= files you provide. ?Carson > On Dec 18, 2018, at 8:20 PM, Prashant Narendra SHINGATE wrote: > > Dear Carson, > > Thanks for clarifying that I should not base gene models on evidence based prediction but train AUGUSTUS. I will carry out AUGUSTUS training using rough gene models predicted in evidence based run and also follow entire annotation protocol. I am assuming that even though Exonerate splits genes based on aligned ESTs/proteins, AUGUSTUS will be able to predict full-length genes. > > I need your suggestion on AUGUSTUS training. We have ~10,000 full-length transcripts and corresponding proteins (besides a large number of fragments) generated by assembling RNAseq transcripts from the same individual used for genome sequencing. However, I understand from AUGUSTUS training tutorial that ~1000 gene models are good enough to train AUGUSTUS. So is it good idea to choose rough gene models predicted based on 10,000 full-length transcripts/proteins for training AUGUSTUS or should I randomly pick 1000 gene models from the entire evidence based run? > > Also about hint files used as input for AUGUSTUS, should I include BLAST alignments of transcripts (BLASTN) and proteins (BLASTX) in hint files or only exonerate alignments are recommended in hint files? Please clarify. > > Thanks once again for your help and time. > > Best Regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 11:15 PM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > You are using est2genome and protein2genome. It is not doing gene prediction, rather it?s just tiling EST?s or trying to turn protein alignments directly into rough models to be used as training sets. That is why the gene is split, because there is no long transcript alignment, just two alignments that are cut and pasted directly from exonerate down onto the assembly. You should not use est2genome or protein2genome as final models. You need to train SNAP or Augustus. > > See Documentation ?> > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors > > ?Carson > > > > On Dec 18, 2018, at 12:23 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > Thank you for the reply. > > Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file. > > SCAffold61 blastx protein_match 698453 840581 730 - . ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1 > > However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow: > > SCAffold61 maker gene 708052 717805 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95 > SCAffold61 maker gene 748651 770415 . - . ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96 > > It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus. I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly. > > I will be glad to send all reference protein and transcript sequences used for annotation, if required. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate, PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: Tuesday, 18 December, 2018 1:38 AM > To: Prashant Narendra SHINGATE > Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org > Subject: Re: About split genes in MAKER annotation > > It?s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can?t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model. > > ?Carson > > On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE > wrote: > > Hi Carson, > > I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh?s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters. > > We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full. > > For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required. > > Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem. > > Thanks for your time and help. > > Best regards, > > Prashant Shingate,?PhD :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR) > 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117 :: http://www.imcb.a-star.edu.sg/ > We advance science and develop innovative technology to further economic growth and improve lives. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. > > > > > Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: