From carsonhh at gmail.com Mon Feb 2 10:35:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:35:59 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: Message-ID: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. ?Carson > On Jan 31, 2015, at 4:21 PM, Jason Stajich wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: > Thanks Mikael, > > This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either > > n n:500 n:N50 min N80 N50 N20 E-size max sum > 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 > > > > 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: > Hi Xabier, > >> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >> >> Hi all, >> >> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >> >> # Statistics of the completeness of the genome based on 248 CEGs # >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 181 72.98 - 365 2.02 67.40 >> Partial 230 92.74 - 528 2.30 77.83 > > > Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. > >> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >> >> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments > > Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. > >> >> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >> >> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? > > Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. > > Just some 2 cents of observations of mine, > cheers, > Mikael > >> >> Thank you in advance, >> >> Xabier >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 10:40:06 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:40:06 -0700 Subject: [maker-devel] How to improve the result of Maker In-Reply-To: References: <492A6635-67E9-4700-B544-E137C4248E55@gmail.com> Message-ID: <1F69B446-8899-41BE-BFB8-5DA61BB359A8@gmail.com> When you add a new exon, apollo will always recalculate the reading frame to take the longest ORF, so even though the first exon might not be the same, the other exons don?t allow for a longer ORF either. So the ORF you got was the longest possible given any combination of all exons (even if the first exon would have been made as UTR). So that confirms my suspicion that that particular exon was ignored because it breaks any possible reading frame. It likely contains an assembly error. ?Carson > On Jan 31, 2015, at 8:54 AM, ??? wrote: > > > There are two possibilities. Given how different the snap and augustus models are from one another, this would suggest they have not been trained appropriately (for example if you are picking another related organisms parameter file rather than training these programs, there are several assumptions that are being made that can actually make such an approach almost worse than just picking a parameter file at random). But more likely the evidence supported exon breaks the reading frame of the model. This usually indicates that you have an assembly error (possibly issues with homopolymers). No amount of evidence support will allow you to call an exon that generates a mis-sense causing frameshift, so the predictors do the next most reasonable thing - they drop the exon if another model is tenable. More concerning would be the mRNA-seq alignments near the 3? end of the gene call. The structure suggests significant capture of background transcription with the mRNA-seq reads (long UTRs with weird mini-introns). I would suggest not using cufflinks in this case. You should probably go with an assembly based approach of mRNA-seq reads instead. I would suggest using trinity. It will reduce sensitivity but greatly increase evidence specificity which is where you need the most improvement based on these images. I would also suggest using the jaccard_clip option with trinity. > > I would further suggest looking at the model in question using apollo, and manually adding the exon (click and drag it into the model). You can examine the reading frame after adding the exon and see if it is in fact a frameshift assembly error. If it?s a homopolymer derived frameshift, then you can expect a lot more of these throughout your assembly. > > I drag the exon into the model, there is a stop codon in it, it causes the region behind it become UTR, here: > > the question exon was pointed by red arrow. But the uppermost evidence is the completed EST from NCBI, and it contains start and stop codon. Then I noticed the 5' boundary of the 2nd codon in model is not the same as EST, so it makes frameshift, and cause the stop codon in the exon pointed by red arrow. The first exon should not be CDS, as there would be a start codon in 2nd exon if its 5' boundary is predicted correctly. Would "always_complete=1" fix it? > > I will try to use trinity. > > Also I do not see any protein alignments here? MAKER cannot work on transcript evidence alone. You need to provide the full proteome of at least two other species (they don?t have to be that closely related, but closer is better). Protein alignments will also help you better interpret the coding status of exons supported by mRNA-seq. For example in the second image, you would expect protein evidence to support all the coding exons but not the UTR exons which would remove any doubt as to whether an exon is really UTR or not. > > I did use 3 sources of protein evidence, one is proteome from related species, and one is proteome from fruitfly, and the last one is Swiss-prot. > > Thank you very much! > > Best regards, > Wenbo > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Feb 2 15:49:02 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 3 Feb 2015 08:49:02 +1100 Subject: [maker-devel] genome duplication? In-Reply-To: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp On 03/02/2015 3:36 AM, "Carson Holt" wrote: > MAKER requires every gene to have at least some evidence support. This is > very important for most most eukaryotes as false positive predictions will > dominate what is called by snap/augustus. However, it is not such a large > problem in fungi because of their high gene density and less frequent > introns. Setting keep_preds=1 will maximize sensitivity at the cost of > specificity (bad idea in most eukaryotes, but not so much in fungi). I > would not be surprised if a bias toward sensitivity is used by most fungi > annotation projects with every gene that can be annotated being annotated > (even if it does increase false positives). It is a tactic that can work > at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have > evidence support for all genes as the evidence alignments will not meet the > % coverage thresholds in the maker_bopts.ctl file. You may want to > separate out your shorter contigs, and annotate them separately with more > relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and > en_score_limit=. > > ?Carson > > > On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with > the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and > genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which > have little support in MAKER - I am not sure if their pipeline runs with > augustus/snap using informant hints though usually they are bringing RNAseq > into the mix - I don't know if your approach for reannotation assembled the > RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of > shared genes in the first 1KFG paper so we may be able to say with more > certainty of these extra predictions whether they are shared more widely > and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a >> great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size >> max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 >> 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling < >> mikael.durling at slu.se>: >> >>> Hi Xabier, >>> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >> >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), >>> with many contigs/scaffolds and based on CEGMA analysis only may indicate a >>> potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs >>>> # >>>> #Prots %Completeness - #Total Average %Ortho >>>> >>>> Complete 181 72.98 - 365 2.02 67.40 >>>> Partial 230 92.74 - 528 2.30 77.83 >>>> >>> >>> >>> Judging from these figure, you seem to have a very fragmented >>> assembly? What N50 have you reached? According to my experience, assemblies >>> with an N50 below 5-10 times the average gene length tend to give problems >>> in producing good gene sets. Not to say that the gene sets are unusable, >>> but for comparing e.g. gene complements to other species, it will be hard >>> to draw any conclusions when a high proportion of the genes are incomplete. >>> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in >>> comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related >>> fungi with nearly 90% of its genes present in at least two copies. >>> Paper: >>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I >>> trained SNAP and Augustus, and I generated a specific RepeatModeler >>> library. I recorded the predicted outputs from each Maker run (AED, number >>> of predicted proteins and transcripts...). Both Augustus and SNAP used to >>> give quite high number (~19000 and ~23000 respectively) in comparison with >>> the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, >>> how does maker deal with gene duplications? Or is this just a phenomenon >>> given that there is no support from the protein files provided initially to >>> Maker? I've used 4 different protein files for the annotation, could it be >>> that they weren't the best choices? I picked them from the closest >>> relatives and similar environments >>> >>> >>> Unless you by mistake filter out duplicated gene families as repeats >>> with repeat modeler, maker should not care about duplicated genes. However, >>> maker, without keep_preds=1, reports only genes with some kind of support >>> (be it EST or protein homology). This is rather conservative, but if you >>> enable keep_preds, you will get more genes as you have noted. Just for the >>> sake of comparison, I have reannotad more than ten genomes downloaded from >>> JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER >>> is reporting fewer gene models. I have yet to do a more thorough comparison >>> to tell what genes JGI are reporting that don?t appear in the MAKER >>> annotations. >>> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the >>> xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated >>> genomes from the JGI and most of them have two annotation folders >>> "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been >>> using the protein files found in the later as I expected to have real >>> evidence and a lower chance of being predicting false genes. Am I right? >>> >>> >>> Yes, I would say so. The FilteredModels have passed through their >>> model selection pipeline, while all_models contains models from all >>> predictors, as well as combinations of predictors and EST evidence. >>> >>> Just some 2 cents of observations of mine, >>> cheers, >>> Mikael >>> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 15:50:02 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 14:50:02 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: <441998AE-D660-485F-BAFD-44BD50765156@gmail.com> Anything less than 10kb. ?Carson > On Feb 2, 2015, at 2:49 PM, Xabier V?zquez Campos wrote: > > Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp > > On 03/02/2015 3:36 AM, "Carson Holt" > wrote: > MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. > > ?Carson > > >> On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: >> >> Xabier - >> FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) >> Hw version 1 asmbly - >> N50 9623; Max 71563 >> CEGMA for Hw1 >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 196 79.03 - 498 2.54 81.12 >> Partial 228 91.94 - 673 2.95 95.18 >> >> >> Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? >> >> We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. >> >> Jason >> >> Jason Stajich >> jason.stajich at gmail.com >> >> On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: >> Hi Xabier, >> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs # >>> #Prots %Completeness - #Total Average %Ortho >>> >>> Complete 181 72.98 - 365 2.02 67.40 >>> Partial 230 92.74 - 528 2.30 77.83 >> >> >> Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. >> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >>> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments >> >> Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. >> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? >> >> Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. >> >> Just some 2 cents of observations of mine, >> cheers, >> Mikael >> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 3 12:13:13 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 03 Feb 2015 10:13:13 -0800 (PST) Subject: [maker-devel] Est2Genome Problems Message-ID: <1422987193321.4df3c9d5@Nodemailer> Hi Folks, I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. ?I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. ?I even ran the accessory script gff3merge to check that the resulting file was properly formatted. For options, I set est2genome=1 and est_gff=cufflinks.gff. ?I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. ?Is there another option that I need to enable in order to use my est_gff file? ?I?m trying to get a set of genes to train the predictors for my next step. Any help would (as always) be greatly appreciated! Best, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Feb 5 08:37:27 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 5 Feb 2015 14:37:27 +0000 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Dear, I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? PS: can I add a question on the google group? I couldn?t start a new topic Thanks in advance, Arne Van Hoeck [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 10:27:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 09:27:41 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Message-ID: <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. ?Carson > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 11:22:12 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 10:22:12 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> Message-ID: <19D25327-4D46-44B4-854B-1BEEFBD23C06@gmail.com> I find that erring on the side of specificity works better for most annotation projects. But this is not always true, and you can try a few large contigs using an alignment approach like cufflinks and compare it to an assembly approach like trinity to decide which appears to perform better. Also you need to take into account the ultimate goal of the project. Some projects want to annotate absolutely everything and don?t care about false positives, while others want to maximize specificity and care more about having bad models. Often times this has to do with some planned downstream experiment that would be adversely affected by one or the other. I tend to prefer high specificity because MAKER?s automated approach to re-annotation means that if evidence ever presents itself later on that a real gene is missing, then that evidence automatically supports inclusion of the gene in the next automated release of the genome. But false models tends to persist and are harder to get rid of even though they lack any evidence support. These false models produced by sensitivity focused approaches then tend to poison downstream experiments and lead to more time being wasted by researchers. This is seen a lot in plant genomes where transposons and pseudogenes tend to pollute genome releases for historical reasons. Basically once they were in the genome release, then the burden of proof for removing them becomes higher than if they were never included in the first place. For researchers unaware of this, they may find they have been studying a transposon for weeks or months because some expression or variant analysis early on listed it as a canidate gene for some desired phenotype. MAKER can handle several hundred thousand contigs in the assembly, but in general contigs smaller than 10kb will not be annotatable (although smaller contigs can be used for gene dense organisms with short introns). It is better to exclude these short contigs from the analysis for processing efficiency. ?Carson > On Feb 5, 2015, at 9:52 AM, Van Hoeck Arne wrote: > > Thanks for this comprehensive and clear answer, Carlson. > > So in conclusion, it s better to make a concise file with very accurate transcripts (assembly method) instead of large possibly transcripts (map RNAseq data to reference) with contain more false positives. > > Another small question, can MAKER handle a lot of contigs (around 10.000) or is it better to make artificial chromosomes by pasting contigs to each other with an certain number N?s (let s say 1000 > exon length). > > Thanks a lot for your quick response > Arne > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: donderdag 5 februari 2015 17:28 > To: Van Hoeck Arne > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) > > There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. > > With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. > > What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. > > For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. > > There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. > > ?Carson > > > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne > wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Mon Feb 9 17:20:34 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 18:20:34 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker Message-ID: Greetings, I notices some cases in the output of Maker, that the ORF is not the longest one, e.g. the one below [image: ???? 1] If I manually correct it in Apollo as "calculate longes ORF", then it become [image: ???? 2] I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? Thanks very much! Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4523 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4351 bytes Desc: not available URL: From dence at genetics.utah.edu Mon Feb 9 18:14:45 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 10 Feb 2015 00:14:45 +0000 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. ~Daniel > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > Greetings, > > I notices some cases in the output of Maker, that the ORF is not the longest one, > e.g. the one below > > > If I manually correct it in Apollo as "calculate longes ORF", then it become > > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? > > Thanks very much! > > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Feb 9 20:06:31 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 21:06:31 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi Daniel, Thank you very much for suggestion. I used three predictors, SNAP, Augustus and pred_gff. >From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? [image: ???? 1] Thanks again Best regards, Wenbo 2015-02-09 19:14 GMT-05:00 Daniel Ence : > Hi, In the images that you sent, it looks like the ab-initio predictor had > predicted two ORF?s, while the evidence supported a single model. MAKER > doesn?t have an option to prefer longer models; it?s metric is to choose > the prediction that is best supported by the alignment evidence. > > How many ab-initio predictors did you use in generating the results that > you sent us? It looks like you only used one, which won?t give good results. > > ~Daniel > > > > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > > > Greetings, > > > > I notices some cases in the output of Maker, that the ORF is not the > longest one, > > e.g. the one below > > > > > > If I manually correct it in Apollo as "calculate longes ORF", then it > become > > > > I thought the updated one should make more sense. So how to let Maker > output the longest ORF automatically? > > > > Thanks very much! > > > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 8160 bytes Desc: not available URL: From carsonhh at gmail.com Mon Feb 9 20:22:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 9 Feb 2015 19:22:46 -0700 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: <669148EB-B537-4004-B323-D119B1056269@gmail.com> The gene model from maker is restricted to use the reading frame of the ab initio predictor. The better model would use a different reading frame. The augustus model has a missing exon so gets a lower score. Snap in general just looks bad. I'd say it needs to be retrained or maybe just drop Snap from the analysis. --Carson Sent from my iPhone > On Feb 9, 2015, at 7:06 PM, ??? wrote: > > Hi Daniel, > > Thank you very much for suggestion. > > I used three predictors, SNAP, Augustus and pred_gff. > From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? > > > > Thanks again > Best regards, > Wenbo > > 2015-02-09 19:14 GMT-05:00 Daniel Ence : >> Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. >> >> How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. >> >> ~Daniel >> >> >> > On Feb 9, 2015, at 4:20 PM, ??? wrote: >> > >> > Greetings, >> > >> > I notices some cases in the output of Maker, that the ORF is not the longest one, >> > e.g. the one below >> > >> > >> > If I manually correct it in Apollo as "calculate longes ORF", then it become >> > >> > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? >> > >> > Thanks very much! >> > >> > Wenbo >> > _______________________________________________ >> > maker-devel mailing list >> > maker-devel at box290.bluehost.com >> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 10:56:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 09:56:53 -0700 Subject: [maker-devel] Est2Genome Problems In-Reply-To: <1422987193321.4df3c9d5@Nodemailer> References: <1422987193321.4df3c9d5@Nodemailer> Message-ID: <119684F8-8071-4318-A129-3D90EC54242A@gmail.com> I ran a few est2genome runs with a cufflinks file i just generated and did not get any issues for EST based gene models. I?d like to at least have your test set to see if I can duplicate what you are seeing. Use this to upload the job files then I can just run it from my server here ?> http://weatherby.genetics.utah.edu/cgi-bin/mwas/bug.cgi ?Carson > On Feb 3, 2015, at 11:13 AM, Jason Gallant wrote: > > Hi Folks, > > I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. I even ran the accessory script gff3merge to check that the resulting file was properly formatted. > > For options, I set est2genome=1 and est_gff=cufflinks.gff. I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. Is there another option that I need to enable in order to use my est_gff file? I?m trying to get a set of genes to train the predictors for my next step. > > Any help would (as always) be greatly appreciated! > > Best, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Tue Feb 10 13:54:46 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 10 Feb 2015 11:54:46 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598085704.ad38b0a2@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful. ?I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 10 14:03:40 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 10 Feb 2015 12:03:40 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598620212.6519c2e@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful.? I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 14:04:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 13:04:15 -0700 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <8D30117C-88DF-4170-9CD8-590AAB79594D@gmail.com> This is awesome. Thanks for going through all the pain of figuring that out. I am definitely going to have to try annotating something through AWS now just to see how it compares to running on a local cluster. ?Carson > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Feb 10 21:37:51 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Wed, 11 Feb 2015 03:37:51 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From myandell at genetics.utah.edu Tue Feb 10 22:08:25 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 11 Feb 2015 04:08:25 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> References: <1423598085704.ad38b0a2@Nodemailer>, <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> Message-ID: <7A60AB257EFF2B48B1F4C814817EA053E372B1C5@mxb2.hg.genetics.utah.edu> Thanks so much Jason. Very informative and helpful for everyone. Cheers! --mark Mark Yandell Professor of Human Genetics H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR Center for Genetic Discovery Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:801-587-7707 ________________________________________ From: Barry Moore Sent: Tuesday, February 10, 2015 8:37 PM To: Jason Gallant; Mark Yandell Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Using MAKER MPI on Amazon Cloud This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From scott at scottcain.net Fri Feb 13 13:53:09 2015 From: scott at scottcain.net (Scott Cain) Date: Fri, 13 Feb 2015 14:53:09 -0500 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Hi Won, I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. Scott On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim wrote: > > Dear Anyone whom may it concern, > > Hello! > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > I try to find maker_tutorial files but I can?t. > > Here the online web site. > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > I just want to get maker_tutorial folder. > > I try to connect Amazon EC2 but there?s no AMI. > > Thank you for your help. > > Won > -- > Yim, Won Cheol > > MS330/Department of Biochemistry & Molecular Biology > > 1664 N. Virginia Street > > University of Nevada, Reno > > email: wyim at unr.edu > > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Feb 13 17:38:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Feb 2015 16:38:15 -0700 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Yes. You have to go to the EC2 management console (US East) and search for the AMI ?> https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images: Change the search options from ?Owned by me? to ?Pubic images? before you do the search. Then search for ami-907e97f8 You can see this on the GMOD MAKER course video where I do this at about the 58 minute timepoint ?> http://youtu.be/uA96tSSaqLk Make sure to increase the resolution to 1080p on the video. ?Carson > On Feb 13, 2015, at 12:53 PM, Scott Cain wrote: > > Hi Won, > > I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. > > Scott > > > > On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim > wrote: > > > > Dear Anyone whom may it concern, > > > > Hello! > > > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > > > I try to find maker_tutorial files but I can?t. > > > > Here the online web site. > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > > > I just want to get maker_tutorial folder. > > > > I try to connect Amazon EC2 but there?s no AMI. > > > > Thank you for your help. > > > > Won > > -- > > Yim, Won Cheol > > > > MS330/Department of Biochemistry & Molecular Biology > > > > 1664 N. Virginia Street > > > > University of Nevada, Reno > > > > email: wyim at unr.edu > > > > > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot net > GMOD Coordinator (http://gmod.org/ ) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Wed Feb 18 09:30:23 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Wed, 18 Feb 2015 16:30:23 +0100 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 19 13:28:22 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Feb 2015 12:28:22 -0700 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP In-Reply-To: References: Message-ID: I would recommend just using the trinity assembly. The cufflinks results tend to be messy. You shouldn?t need the est2genome or protein2genome results if you already trained using cegma results. You can then do one MAKER run (can be on just part of the genome) where you use both SNAP and Augustus as the predictors (est2genome and protein2genome should be turned off), and then give these results back to SNAP to train with again. This second round of bootstrap training is usually beneficial to SNAP (beyond two rounds doesn?t really help). Also don?t concatenate with previous training sets for the second round of bootstrap round of training. The idea is that the second round of training genes will be more correct than the first round, so you want to use them instead. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse. --Carson > On Feb 18, 2015, at 8:30 AM, Kai Kamm wrote: > > Hello > I have just started in this field of research and I want to annotate my assembled non-bilaterian invertebrate genome with Maker (100Mb in 7000 scaffolds) . > > I have red the maker tutorials but I am still a little uncertain about the iterative procedure. What I have already done is: > > - trained Augustus (using the web service) on the reference genome of a closely related species and its published dataset of "best transcripts" which are mainly based on gene prediction and some EST evidence. The published ESTs themselves were rejected from Augustus as being not sufficient for training (to few long transcripts). > - trained SNAP with the CEGMA-output of my genome > - assembled RNA-seq data with tophat/cufflinks and generated gff-file with cufflinks2gff > - de novo assembled RNA-seq data with Trinity > > I have already done some preliminary Maker runs with initially trained Augustus, SNAP and some protein evidence which had good results. > > Now my strategy is: > > running maker with > - the est2genome option using the cufflinks gff and the Trinity transcripts as EST evidence > > - the protein2genome option using a protein file including all proteins of the closely related species, a less related non-bilaterian species and a collection of reviewed Swiss-Prot entries from one representative mammal and all protostomes > > - Augustus and SNAP for gene prediction > > When this is done I want to: > > - create 2nd training set for SNAP from the merged gffs with maker2zff > - train Augustus again with the Maker transcripts using the Augustus web service > > And run Maker again > > Is this a reasonable procedure? Or am I missing some important aspects here? > Thanks in advance? > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Fri Feb 20 08:53:38 2015 From: marc.hoeppner at imbim.uu.se (=?utf-8?B?TWFyYyBIw7ZwcG5lcg==?=) Date: Fri, 20 Feb 2015 14:53:38 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? Message-ID: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Hi, we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: single_exon=1 single_length=100 I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). Maker version is 2.31-8 Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? Cheers, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Fri Feb 20 12:01:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Feb 2015 18:01:06 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: Hi Marc, Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. ~Daniel > On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: > > Hi, > > we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: > > single_exon=1 > single_length=100 > > I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. > > (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). > > Maker version is 2.31-8 > > Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? > > Cheers, > > Marc > > Marc P. Hoeppner, PhD > Team Leader > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Feb 20 12:07:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Feb 2015 11:07:28 -0700 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: <1DB3F36E-CAAE-4394-B6B6-53009E9566B4@gmail.com> Actually , the issue is that single exon genes have a higher threshold to meet to get UTR. Also MAKER will never add spliced UTR to a single exon gene, so an EST would also have to be single exon and encompass the entire single exon gene to get UTR. This is done because EST and mRNA-seq data in general is noisy enough that you will get mostly false UTR annotations otherwise. So it is an overly conservative approach because it?s the best of all the bad options. For spliced genes, the splice site can be used to confirm concordance of the UTR with the gene, but that can?t be done with single exon calls. ?Carson > On Feb 20, 2015, at 11:01 AM, Daniel Ence wrote: > > Hi Marc, > > Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? > > There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. > > Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. > > ~Daniel > > > > > >> On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: >> >> Hi, >> >> we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: >> >> single_exon=1 >> single_length=100 >> >> I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. >> >> (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). >> >> Maker version is 2.31-8 >> >> Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? >> >> Cheers, >> >> Marc >> >> Marc P. Hoeppner, PhD >> Team Leader >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at imbim.uu.se >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jgallant at msu.edu Wed Feb 25 10:43:21 2015 From: jgallant at msu.edu (Jason Gallant) Date: Wed, 25 Feb 2015 08:43:21 -0800 (PST) Subject: [maker-devel] Evaluating Genome Annotation Message-ID: <1424882600861.a6109243@Nodemailer> Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message).? This is a denovo genome assembly, for which there is no closely related species.? As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set ?of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation.? Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round.? I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence.? Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. ? Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration).? Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. ? Is my method of HMM construction to blame? 5) Am I worried about nothing here?? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Feb 25 11:25:30 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Feb 2015 10:25:30 -0700 Subject: [maker-devel] Evaluating Genome Annotation In-Reply-To: <1424882600861.a6109243@Nodemailer> References: <1424882600861.a6109243@Nodemailer> Message-ID: > Here are my questions: > 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 Your first round should over-predict especially if it is based off of cufflinks results (very noisy). Your second and third rounds look about right for many organisms (both should be similar in gene count), but if you believe it is low for yours then run CEGMA to estimate your genomes completeness (i.e. if your genome is 85% complete then you expect your final number from MAKER to represent about 85% of the true number of genes). Also you may want to increase your protein database. If the refseq genes you are using represent just a subset of the 3 vertebrate genomes rather than the whole genomes of those organisms, then you will want to get a couple of full genomes to work with. Also not having a high completion level genome on vertebrates in now out of the ordinary. In lamprey (an extreme case) the low completion level actually lead to the discovery that it?s cells undergo programed somatic deletion of about 25% of the genome, and since since it?s genome was sequenced off of the somatic tissue, it was obviously missing from the assembly. > 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings That?s what you expect. The third round should show just minor improvement (AED is not a highly precise number so a difference of 1% basically means the second and third round results are identical for evidence support). The real improvement from second round to third round is the quality of the unaided SNAP models (you really only get a sense of this by using apollo to view a few contigs). Because the MAKER models are derived from evidence based hints, they will always be similar between runs, but the raw SNAP models in round 3 will be much more like the MAKER models that the unaided SNAP models from round 2. This convergence helps you know that you gene predictor is trained. You may also want to train Augustus and add that to your set of predictors (look for convergence between MAKER, SNAP, and Augustus models to indicate training has worked). Augustus generally performs better on vertebrates than SNAP. On some vertebrates you actually have to just drop SNAP completely (SNAP runs very poorly on the human genome for example). On genomes where you drop SNAP then you would just use Augustus (look at evidence alignments and convergence between MAKER/SNAP/Augustus models to make that decision). > 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? The default threshold for consideration is 1 bp. But when you actually run the predictors you will realize that they cannot physically put a multi exon gene in contigs bellow about 10kp in length. So MAKER will run them, but you just won?t get any results. > 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? Your HMM?s are probably fine (look for a convergence between SNAP raw and MAKER evidence based models to see if SNAP is behaving well). I think you probably need a better protein database, perhaps need to improve repeat masking as well (try running repeat modeler - I can?t overstate the importance of this since repeats can essentially break a gene predictor). Try adding Augustus to the analysis. Also in general, I?ve found that cufflinks processed evidence is far too noisy and it adversely affects results of annotation. Try processing the transcript data with Trinity instead (you will get better gene models). I doubt additional training of SNAP is necessary. > 5) Am I worried about nothing here? Is this a pretty decent annotation? A reasonable expectation of accuracy for a first draft genome is probably in the upper 70?s to high 80?s. Extremely high quality assemblies with lots of good transcript data might break into the 90?s. For example more than 40% of the genes from the original draft of the mouse genome have since been thrown out over time (http://www.biomedcentral.com/1471-2105/10/67 ). The total gene count has remained similar, but those counts are actually based off of new genes in new locations in the genome. Also the honeybee genome recently got major improvements in there annotations (50% increase in gene count) after fixing problems with the original assembly and annotation process (http://www.biomedcentral.com/1471-2164/15/86 ). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Wed Feb 25 10:40:40 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Wed, 25 Feb 2015 11:40:40 -0500 Subject: [maker-devel] Evaluating Genome Annotation Message-ID: Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message). This is a denovo genome assembly, for which there is no closely related species. As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation. Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round. I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence. Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? 5) Am I worried about nothing here? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 09:35:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:35:59 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: Message-ID: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. ?Carson > On Jan 31, 2015, at 4:21 PM, Jason Stajich wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: > Thanks Mikael, > > This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either > > n n:500 n:N50 min N80 N50 N20 E-size max sum > 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 > > > > 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: > Hi Xabier, > >> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >> >> Hi all, >> >> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >> >> # Statistics of the completeness of the genome based on 248 CEGs # >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 181 72.98 - 365 2.02 67.40 >> Partial 230 92.74 - 528 2.30 77.83 > > > Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. > >> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >> >> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments > > Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. > >> >> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >> >> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? > > Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. > > Just some 2 cents of observations of mine, > cheers, > Mikael > >> >> Thank you in advance, >> >> Xabier >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 09:40:06 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:40:06 -0700 Subject: [maker-devel] How to improve the result of Maker In-Reply-To: References: <492A6635-67E9-4700-B544-E137C4248E55@gmail.com> Message-ID: <1F69B446-8899-41BE-BFB8-5DA61BB359A8@gmail.com> When you add a new exon, apollo will always recalculate the reading frame to take the longest ORF, so even though the first exon might not be the same, the other exons don?t allow for a longer ORF either. So the ORF you got was the longest possible given any combination of all exons (even if the first exon would have been made as UTR). So that confirms my suspicion that that particular exon was ignored because it breaks any possible reading frame. It likely contains an assembly error. ?Carson > On Jan 31, 2015, at 8:54 AM, ??? wrote: > > > There are two possibilities. Given how different the snap and augustus models are from one another, this would suggest they have not been trained appropriately (for example if you are picking another related organisms parameter file rather than training these programs, there are several assumptions that are being made that can actually make such an approach almost worse than just picking a parameter file at random). But more likely the evidence supported exon breaks the reading frame of the model. This usually indicates that you have an assembly error (possibly issues with homopolymers). No amount of evidence support will allow you to call an exon that generates a mis-sense causing frameshift, so the predictors do the next most reasonable thing - they drop the exon if another model is tenable. More concerning would be the mRNA-seq alignments near the 3? end of the gene call. The structure suggests significant capture of background transcription with the mRNA-seq reads (long UTRs with weird mini-introns). I would suggest not using cufflinks in this case. You should probably go with an assembly based approach of mRNA-seq reads instead. I would suggest using trinity. It will reduce sensitivity but greatly increase evidence specificity which is where you need the most improvement based on these images. I would also suggest using the jaccard_clip option with trinity. > > I would further suggest looking at the model in question using apollo, and manually adding the exon (click and drag it into the model). You can examine the reading frame after adding the exon and see if it is in fact a frameshift assembly error. If it?s a homopolymer derived frameshift, then you can expect a lot more of these throughout your assembly. > > I drag the exon into the model, there is a stop codon in it, it causes the region behind it become UTR, here: > > the question exon was pointed by red arrow. But the uppermost evidence is the completed EST from NCBI, and it contains start and stop codon. Then I noticed the 5' boundary of the 2nd codon in model is not the same as EST, so it makes frameshift, and cause the stop codon in the exon pointed by red arrow. The first exon should not be CDS, as there would be a start codon in 2nd exon if its 5' boundary is predicted correctly. Would "always_complete=1" fix it? > > I will try to use trinity. > > Also I do not see any protein alignments here? MAKER cannot work on transcript evidence alone. You need to provide the full proteome of at least two other species (they don?t have to be that closely related, but closer is better). Protein alignments will also help you better interpret the coding status of exons supported by mRNA-seq. For example in the second image, you would expect protein evidence to support all the coding exons but not the UTR exons which would remove any doubt as to whether an exon is really UTR or not. > > I did use 3 sources of protein evidence, one is proteome from related species, and one is proteome from fruitfly, and the last one is Swiss-prot. > > Thank you very much! > > Best regards, > Wenbo > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Feb 2 14:49:02 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 3 Feb 2015 08:49:02 +1100 Subject: [maker-devel] genome duplication? In-Reply-To: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp On 03/02/2015 3:36 AM, "Carson Holt" wrote: > MAKER requires every gene to have at least some evidence support. This is > very important for most most eukaryotes as false positive predictions will > dominate what is called by snap/augustus. However, it is not such a large > problem in fungi because of their high gene density and less frequent > introns. Setting keep_preds=1 will maximize sensitivity at the cost of > specificity (bad idea in most eukaryotes, but not so much in fungi). I > would not be surprised if a bias toward sensitivity is used by most fungi > annotation projects with every gene that can be annotated being annotated > (even if it does increase false positives). It is a tactic that can work > at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have > evidence support for all genes as the evidence alignments will not meet the > % coverage thresholds in the maker_bopts.ctl file. You may want to > separate out your shorter contigs, and annotate them separately with more > relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and > en_score_limit=. > > ?Carson > > > On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with > the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and > genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which > have little support in MAKER - I am not sure if their pipeline runs with > augustus/snap using informant hints though usually they are bringing RNAseq > into the mix - I don't know if your approach for reannotation assembled the > RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of > shared genes in the first 1KFG paper so we may be able to say with more > certainty of these extra predictions whether they are shared more widely > and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a >> great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size >> max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 >> 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling < >> mikael.durling at slu.se>: >> >>> Hi Xabier, >>> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >> >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), >>> with many contigs/scaffolds and based on CEGMA analysis only may indicate a >>> potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs >>>> # >>>> #Prots %Completeness - #Total Average %Ortho >>>> >>>> Complete 181 72.98 - 365 2.02 67.40 >>>> Partial 230 92.74 - 528 2.30 77.83 >>>> >>> >>> >>> Judging from these figure, you seem to have a very fragmented >>> assembly? What N50 have you reached? According to my experience, assemblies >>> with an N50 below 5-10 times the average gene length tend to give problems >>> in producing good gene sets. Not to say that the gene sets are unusable, >>> but for comparing e.g. gene complements to other species, it will be hard >>> to draw any conclusions when a high proportion of the genes are incomplete. >>> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in >>> comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related >>> fungi with nearly 90% of its genes present in at least two copies. >>> Paper: >>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I >>> trained SNAP and Augustus, and I generated a specific RepeatModeler >>> library. I recorded the predicted outputs from each Maker run (AED, number >>> of predicted proteins and transcripts...). Both Augustus and SNAP used to >>> give quite high number (~19000 and ~23000 respectively) in comparison with >>> the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, >>> how does maker deal with gene duplications? Or is this just a phenomenon >>> given that there is no support from the protein files provided initially to >>> Maker? I've used 4 different protein files for the annotation, could it be >>> that they weren't the best choices? I picked them from the closest >>> relatives and similar environments >>> >>> >>> Unless you by mistake filter out duplicated gene families as repeats >>> with repeat modeler, maker should not care about duplicated genes. However, >>> maker, without keep_preds=1, reports only genes with some kind of support >>> (be it EST or protein homology). This is rather conservative, but if you >>> enable keep_preds, you will get more genes as you have noted. Just for the >>> sake of comparison, I have reannotad more than ten genomes downloaded from >>> JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER >>> is reporting fewer gene models. I have yet to do a more thorough comparison >>> to tell what genes JGI are reporting that don?t appear in the MAKER >>> annotations. >>> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the >>> xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated >>> genomes from the JGI and most of them have two annotation folders >>> "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been >>> using the protein files found in the later as I expected to have real >>> evidence and a lower chance of being predicting false genes. Am I right? >>> >>> >>> Yes, I would say so. The FilteredModels have passed through their >>> model selection pipeline, while all_models contains models from all >>> predictors, as well as combinations of predictors and EST evidence. >>> >>> Just some 2 cents of observations of mine, >>> cheers, >>> Mikael >>> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 14:50:02 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 14:50:02 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: <441998AE-D660-485F-BAFD-44BD50765156@gmail.com> Anything less than 10kb. ?Carson > On Feb 2, 2015, at 2:49 PM, Xabier V?zquez Campos wrote: > > Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp > > On 03/02/2015 3:36 AM, "Carson Holt" > wrote: > MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. > > ?Carson > > >> On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: >> >> Xabier - >> FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) >> Hw version 1 asmbly - >> N50 9623; Max 71563 >> CEGMA for Hw1 >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 196 79.03 - 498 2.54 81.12 >> Partial 228 91.94 - 673 2.95 95.18 >> >> >> Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? >> >> We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. >> >> Jason >> >> Jason Stajich >> jason.stajich at gmail.com >> >> On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: >> Hi Xabier, >> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs # >>> #Prots %Completeness - #Total Average %Ortho >>> >>> Complete 181 72.98 - 365 2.02 67.40 >>> Partial 230 92.74 - 528 2.30 77.83 >> >> >> Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. >> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >>> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments >> >> Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. >> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? >> >> Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. >> >> Just some 2 cents of observations of mine, >> cheers, >> Mikael >> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 3 11:13:13 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 03 Feb 2015 10:13:13 -0800 (PST) Subject: [maker-devel] Est2Genome Problems Message-ID: <1422987193321.4df3c9d5@Nodemailer> Hi Folks, I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. ?I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. ?I even ran the accessory script gff3merge to check that the resulting file was properly formatted. For options, I set est2genome=1 and est_gff=cufflinks.gff. ?I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. ?Is there another option that I need to enable in order to use my est_gff file? ?I?m trying to get a set of genes to train the predictors for my next step. Any help would (as always) be greatly appreciated! Best, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Feb 5 07:37:27 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 5 Feb 2015 14:37:27 +0000 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Dear, I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? PS: can I add a question on the google group? I couldn?t start a new topic Thanks in advance, Arne Van Hoeck [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 09:27:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 09:27:41 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Message-ID: <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. ?Carson > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 10:22:12 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 10:22:12 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> Message-ID: <19D25327-4D46-44B4-854B-1BEEFBD23C06@gmail.com> I find that erring on the side of specificity works better for most annotation projects. But this is not always true, and you can try a few large contigs using an alignment approach like cufflinks and compare it to an assembly approach like trinity to decide which appears to perform better. Also you need to take into account the ultimate goal of the project. Some projects want to annotate absolutely everything and don?t care about false positives, while others want to maximize specificity and care more about having bad models. Often times this has to do with some planned downstream experiment that would be adversely affected by one or the other. I tend to prefer high specificity because MAKER?s automated approach to re-annotation means that if evidence ever presents itself later on that a real gene is missing, then that evidence automatically supports inclusion of the gene in the next automated release of the genome. But false models tends to persist and are harder to get rid of even though they lack any evidence support. These false models produced by sensitivity focused approaches then tend to poison downstream experiments and lead to more time being wasted by researchers. This is seen a lot in plant genomes where transposons and pseudogenes tend to pollute genome releases for historical reasons. Basically once they were in the genome release, then the burden of proof for removing them becomes higher than if they were never included in the first place. For researchers unaware of this, they may find they have been studying a transposon for weeks or months because some expression or variant analysis early on listed it as a canidate gene for some desired phenotype. MAKER can handle several hundred thousand contigs in the assembly, but in general contigs smaller than 10kb will not be annotatable (although smaller contigs can be used for gene dense organisms with short introns). It is better to exclude these short contigs from the analysis for processing efficiency. ?Carson > On Feb 5, 2015, at 9:52 AM, Van Hoeck Arne wrote: > > Thanks for this comprehensive and clear answer, Carlson. > > So in conclusion, it s better to make a concise file with very accurate transcripts (assembly method) instead of large possibly transcripts (map RNAseq data to reference) with contain more false positives. > > Another small question, can MAKER handle a lot of contigs (around 10.000) or is it better to make artificial chromosomes by pasting contigs to each other with an certain number N?s (let s say 1000 > exon length). > > Thanks a lot for your quick response > Arne > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: donderdag 5 februari 2015 17:28 > To: Van Hoeck Arne > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) > > There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. > > With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. > > What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. > > For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. > > There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. > > ?Carson > > > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne > wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Mon Feb 9 16:20:34 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 18:20:34 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker Message-ID: Greetings, I notices some cases in the output of Maker, that the ORF is not the longest one, e.g. the one below [image: ???? 1] If I manually correct it in Apollo as "calculate longes ORF", then it become [image: ???? 2] I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? Thanks very much! Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4523 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4351 bytes Desc: not available URL: From dence at genetics.utah.edu Mon Feb 9 17:14:45 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 10 Feb 2015 00:14:45 +0000 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. ~Daniel > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > Greetings, > > I notices some cases in the output of Maker, that the ORF is not the longest one, > e.g. the one below > > > If I manually correct it in Apollo as "calculate longes ORF", then it become > > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? > > Thanks very much! > > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Feb 9 19:06:31 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 21:06:31 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi Daniel, Thank you very much for suggestion. I used three predictors, SNAP, Augustus and pred_gff. >From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? [image: ???? 1] Thanks again Best regards, Wenbo 2015-02-09 19:14 GMT-05:00 Daniel Ence : > Hi, In the images that you sent, it looks like the ab-initio predictor had > predicted two ORF?s, while the evidence supported a single model. MAKER > doesn?t have an option to prefer longer models; it?s metric is to choose > the prediction that is best supported by the alignment evidence. > > How many ab-initio predictors did you use in generating the results that > you sent us? It looks like you only used one, which won?t give good results. > > ~Daniel > > > > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > > > Greetings, > > > > I notices some cases in the output of Maker, that the ORF is not the > longest one, > > e.g. the one below > > > > > > If I manually correct it in Apollo as "calculate longes ORF", then it > become > > > > I thought the updated one should make more sense. So how to let Maker > output the longest ORF automatically? > > > > Thanks very much! > > > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 8160 bytes Desc: not available URL: From carsonhh at gmail.com Mon Feb 9 19:22:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 9 Feb 2015 19:22:46 -0700 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: <669148EB-B537-4004-B323-D119B1056269@gmail.com> The gene model from maker is restricted to use the reading frame of the ab initio predictor. The better model would use a different reading frame. The augustus model has a missing exon so gets a lower score. Snap in general just looks bad. I'd say it needs to be retrained or maybe just drop Snap from the analysis. --Carson Sent from my iPhone > On Feb 9, 2015, at 7:06 PM, ??? wrote: > > Hi Daniel, > > Thank you very much for suggestion. > > I used three predictors, SNAP, Augustus and pred_gff. > From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? > > > > Thanks again > Best regards, > Wenbo > > 2015-02-09 19:14 GMT-05:00 Daniel Ence : >> Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. >> >> How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. >> >> ~Daniel >> >> >> > On Feb 9, 2015, at 4:20 PM, ??? wrote: >> > >> > Greetings, >> > >> > I notices some cases in the output of Maker, that the ORF is not the longest one, >> > e.g. the one below >> > >> > >> > If I manually correct it in Apollo as "calculate longes ORF", then it become >> > >> > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? >> > >> > Thanks very much! >> > >> > Wenbo >> > _______________________________________________ >> > maker-devel mailing list >> > maker-devel at box290.bluehost.com >> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 09:56:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 09:56:53 -0700 Subject: [maker-devel] Est2Genome Problems In-Reply-To: <1422987193321.4df3c9d5@Nodemailer> References: <1422987193321.4df3c9d5@Nodemailer> Message-ID: <119684F8-8071-4318-A129-3D90EC54242A@gmail.com> I ran a few est2genome runs with a cufflinks file i just generated and did not get any issues for EST based gene models. I?d like to at least have your test set to see if I can duplicate what you are seeing. Use this to upload the job files then I can just run it from my server here ?> http://weatherby.genetics.utah.edu/cgi-bin/mwas/bug.cgi ?Carson > On Feb 3, 2015, at 11:13 AM, Jason Gallant wrote: > > Hi Folks, > > I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. I even ran the accessory script gff3merge to check that the resulting file was properly formatted. > > For options, I set est2genome=1 and est_gff=cufflinks.gff. I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. Is there another option that I need to enable in order to use my est_gff file? I?m trying to get a set of genes to train the predictors for my next step. > > Any help would (as always) be greatly appreciated! > > Best, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Tue Feb 10 12:54:46 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 10 Feb 2015 11:54:46 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598085704.ad38b0a2@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful. ?I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 10 13:03:40 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 10 Feb 2015 12:03:40 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598620212.6519c2e@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful.? I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 13:04:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 13:04:15 -0700 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <8D30117C-88DF-4170-9CD8-590AAB79594D@gmail.com> This is awesome. Thanks for going through all the pain of figuring that out. I am definitely going to have to try annotating something through AWS now just to see how it compares to running on a local cluster. ?Carson > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Feb 10 20:37:51 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Wed, 11 Feb 2015 03:37:51 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From myandell at genetics.utah.edu Tue Feb 10 21:08:25 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 11 Feb 2015 04:08:25 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> References: <1423598085704.ad38b0a2@Nodemailer>, <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> Message-ID: <7A60AB257EFF2B48B1F4C814817EA053E372B1C5@mxb2.hg.genetics.utah.edu> Thanks so much Jason. Very informative and helpful for everyone. Cheers! --mark Mark Yandell Professor of Human Genetics H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR Center for Genetic Discovery Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:801-587-7707 ________________________________________ From: Barry Moore Sent: Tuesday, February 10, 2015 8:37 PM To: Jason Gallant; Mark Yandell Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Using MAKER MPI on Amazon Cloud This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From scott at scottcain.net Fri Feb 13 12:53:09 2015 From: scott at scottcain.net (Scott Cain) Date: Fri, 13 Feb 2015 14:53:09 -0500 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Hi Won, I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. Scott On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim wrote: > > Dear Anyone whom may it concern, > > Hello! > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > I try to find maker_tutorial files but I can?t. > > Here the online web site. > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > I just want to get maker_tutorial folder. > > I try to connect Amazon EC2 but there?s no AMI. > > Thank you for your help. > > Won > -- > Yim, Won Cheol > > MS330/Department of Biochemistry & Molecular Biology > > 1664 N. Virginia Street > > University of Nevada, Reno > > email: wyim at unr.edu > > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Feb 13 16:38:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Feb 2015 16:38:15 -0700 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Yes. You have to go to the EC2 management console (US East) and search for the AMI ?> https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images: Change the search options from ?Owned by me? to ?Pubic images? before you do the search. Then search for ami-907e97f8 You can see this on the GMOD MAKER course video where I do this at about the 58 minute timepoint ?> http://youtu.be/uA96tSSaqLk Make sure to increase the resolution to 1080p on the video. ?Carson > On Feb 13, 2015, at 12:53 PM, Scott Cain wrote: > > Hi Won, > > I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. > > Scott > > > > On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim > wrote: > > > > Dear Anyone whom may it concern, > > > > Hello! > > > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > > > I try to find maker_tutorial files but I can?t. > > > > Here the online web site. > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > > > I just want to get maker_tutorial folder. > > > > I try to connect Amazon EC2 but there?s no AMI. > > > > Thank you for your help. > > > > Won > > -- > > Yim, Won Cheol > > > > MS330/Department of Biochemistry & Molecular Biology > > > > 1664 N. Virginia Street > > > > University of Nevada, Reno > > > > email: wyim at unr.edu > > > > > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot net > GMOD Coordinator (http://gmod.org/ ) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Wed Feb 18 08:30:23 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Wed, 18 Feb 2015 16:30:23 +0100 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 19 12:28:22 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Feb 2015 12:28:22 -0700 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP In-Reply-To: References: Message-ID: I would recommend just using the trinity assembly. The cufflinks results tend to be messy. You shouldn?t need the est2genome or protein2genome results if you already trained using cegma results. You can then do one MAKER run (can be on just part of the genome) where you use both SNAP and Augustus as the predictors (est2genome and protein2genome should be turned off), and then give these results back to SNAP to train with again. This second round of bootstrap training is usually beneficial to SNAP (beyond two rounds doesn?t really help). Also don?t concatenate with previous training sets for the second round of bootstrap round of training. The idea is that the second round of training genes will be more correct than the first round, so you want to use them instead. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse. --Carson > On Feb 18, 2015, at 8:30 AM, Kai Kamm wrote: > > Hello > I have just started in this field of research and I want to annotate my assembled non-bilaterian invertebrate genome with Maker (100Mb in 7000 scaffolds) . > > I have red the maker tutorials but I am still a little uncertain about the iterative procedure. What I have already done is: > > - trained Augustus (using the web service) on the reference genome of a closely related species and its published dataset of "best transcripts" which are mainly based on gene prediction and some EST evidence. The published ESTs themselves were rejected from Augustus as being not sufficient for training (to few long transcripts). > - trained SNAP with the CEGMA-output of my genome > - assembled RNA-seq data with tophat/cufflinks and generated gff-file with cufflinks2gff > - de novo assembled RNA-seq data with Trinity > > I have already done some preliminary Maker runs with initially trained Augustus, SNAP and some protein evidence which had good results. > > Now my strategy is: > > running maker with > - the est2genome option using the cufflinks gff and the Trinity transcripts as EST evidence > > - the protein2genome option using a protein file including all proteins of the closely related species, a less related non-bilaterian species and a collection of reviewed Swiss-Prot entries from one representative mammal and all protostomes > > - Augustus and SNAP for gene prediction > > When this is done I want to: > > - create 2nd training set for SNAP from the merged gffs with maker2zff > - train Augustus again with the Maker transcripts using the Augustus web service > > And run Maker again > > Is this a reasonable procedure? Or am I missing some important aspects here? > Thanks in advance? > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Fri Feb 20 07:53:38 2015 From: marc.hoeppner at imbim.uu.se (=?utf-8?B?TWFyYyBIw7ZwcG5lcg==?=) Date: Fri, 20 Feb 2015 14:53:38 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? Message-ID: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Hi, we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: single_exon=1 single_length=100 I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). Maker version is 2.31-8 Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? Cheers, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Fri Feb 20 11:01:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Feb 2015 18:01:06 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: Hi Marc, Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. ~Daniel > On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: > > Hi, > > we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: > > single_exon=1 > single_length=100 > > I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. > > (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). > > Maker version is 2.31-8 > > Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? > > Cheers, > > Marc > > Marc P. Hoeppner, PhD > Team Leader > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Feb 20 11:07:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Feb 2015 11:07:28 -0700 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: <1DB3F36E-CAAE-4394-B6B6-53009E9566B4@gmail.com> Actually , the issue is that single exon genes have a higher threshold to meet to get UTR. Also MAKER will never add spliced UTR to a single exon gene, so an EST would also have to be single exon and encompass the entire single exon gene to get UTR. This is done because EST and mRNA-seq data in general is noisy enough that you will get mostly false UTR annotations otherwise. So it is an overly conservative approach because it?s the best of all the bad options. For spliced genes, the splice site can be used to confirm concordance of the UTR with the gene, but that can?t be done with single exon calls. ?Carson > On Feb 20, 2015, at 11:01 AM, Daniel Ence wrote: > > Hi Marc, > > Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? > > There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. > > Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. > > ~Daniel > > > > > >> On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: >> >> Hi, >> >> we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: >> >> single_exon=1 >> single_length=100 >> >> I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. >> >> (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). >> >> Maker version is 2.31-8 >> >> Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? >> >> Cheers, >> >> Marc >> >> Marc P. Hoeppner, PhD >> Team Leader >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at imbim.uu.se >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jgallant at msu.edu Wed Feb 25 09:43:21 2015 From: jgallant at msu.edu (Jason Gallant) Date: Wed, 25 Feb 2015 08:43:21 -0800 (PST) Subject: [maker-devel] Evaluating Genome Annotation Message-ID: <1424882600861.a6109243@Nodemailer> Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message).? This is a denovo genome assembly, for which there is no closely related species.? As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set ?of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation.? Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round.? I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence.? Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. ? Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration).? Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. ? Is my method of HMM construction to blame? 5) Am I worried about nothing here?? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Feb 25 10:25:30 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Feb 2015 10:25:30 -0700 Subject: [maker-devel] Evaluating Genome Annotation In-Reply-To: <1424882600861.a6109243@Nodemailer> References: <1424882600861.a6109243@Nodemailer> Message-ID: > Here are my questions: > 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 Your first round should over-predict especially if it is based off of cufflinks results (very noisy). Your second and third rounds look about right for many organisms (both should be similar in gene count), but if you believe it is low for yours then run CEGMA to estimate your genomes completeness (i.e. if your genome is 85% complete then you expect your final number from MAKER to represent about 85% of the true number of genes). Also you may want to increase your protein database. If the refseq genes you are using represent just a subset of the 3 vertebrate genomes rather than the whole genomes of those organisms, then you will want to get a couple of full genomes to work with. Also not having a high completion level genome on vertebrates in now out of the ordinary. In lamprey (an extreme case) the low completion level actually lead to the discovery that it?s cells undergo programed somatic deletion of about 25% of the genome, and since since it?s genome was sequenced off of the somatic tissue, it was obviously missing from the assembly. > 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings That?s what you expect. The third round should show just minor improvement (AED is not a highly precise number so a difference of 1% basically means the second and third round results are identical for evidence support). The real improvement from second round to third round is the quality of the unaided SNAP models (you really only get a sense of this by using apollo to view a few contigs). Because the MAKER models are derived from evidence based hints, they will always be similar between runs, but the raw SNAP models in round 3 will be much more like the MAKER models that the unaided SNAP models from round 2. This convergence helps you know that you gene predictor is trained. You may also want to train Augustus and add that to your set of predictors (look for convergence between MAKER, SNAP, and Augustus models to indicate training has worked). Augustus generally performs better on vertebrates than SNAP. On some vertebrates you actually have to just drop SNAP completely (SNAP runs very poorly on the human genome for example). On genomes where you drop SNAP then you would just use Augustus (look at evidence alignments and convergence between MAKER/SNAP/Augustus models to make that decision). > 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? The default threshold for consideration is 1 bp. But when you actually run the predictors you will realize that they cannot physically put a multi exon gene in contigs bellow about 10kp in length. So MAKER will run them, but you just won?t get any results. > 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? Your HMM?s are probably fine (look for a convergence between SNAP raw and MAKER evidence based models to see if SNAP is behaving well). I think you probably need a better protein database, perhaps need to improve repeat masking as well (try running repeat modeler - I can?t overstate the importance of this since repeats can essentially break a gene predictor). Try adding Augustus to the analysis. Also in general, I?ve found that cufflinks processed evidence is far too noisy and it adversely affects results of annotation. Try processing the transcript data with Trinity instead (you will get better gene models). I doubt additional training of SNAP is necessary. > 5) Am I worried about nothing here? Is this a pretty decent annotation? A reasonable expectation of accuracy for a first draft genome is probably in the upper 70?s to high 80?s. Extremely high quality assemblies with lots of good transcript data might break into the 90?s. For example more than 40% of the genes from the original draft of the mouse genome have since been thrown out over time (http://www.biomedcentral.com/1471-2105/10/67 ). The total gene count has remained similar, but those counts are actually based off of new genes in new locations in the genome. Also the honeybee genome recently got major improvements in there annotations (50% increase in gene count) after fixing problems with the original assembly and annotation process (http://www.biomedcentral.com/1471-2164/15/86 ). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Wed Feb 25 09:40:40 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Wed, 25 Feb 2015 11:40:40 -0500 Subject: [maker-devel] Evaluating Genome Annotation Message-ID: Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message). This is a denovo genome assembly, for which there is no closely related species. As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation. Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round. I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence. Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? 5) Am I worried about nothing here? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 09:35:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:35:59 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: Message-ID: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. ?Carson > On Jan 31, 2015, at 4:21 PM, Jason Stajich wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: > Thanks Mikael, > > This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either > > n n:500 n:N50 min N80 N50 N20 E-size max sum > 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 > > > > 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: > Hi Xabier, > >> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >> >> Hi all, >> >> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >> >> # Statistics of the completeness of the genome based on 248 CEGs # >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 181 72.98 - 365 2.02 67.40 >> Partial 230 92.74 - 528 2.30 77.83 > > > Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. > >> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >> >> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments > > Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. > >> >> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >> >> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? > > Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. > > Just some 2 cents of observations of mine, > cheers, > Mikael > >> >> Thank you in advance, >> >> Xabier >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 09:40:06 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:40:06 -0700 Subject: [maker-devel] How to improve the result of Maker In-Reply-To: References: <492A6635-67E9-4700-B544-E137C4248E55@gmail.com> Message-ID: <1F69B446-8899-41BE-BFB8-5DA61BB359A8@gmail.com> When you add a new exon, apollo will always recalculate the reading frame to take the longest ORF, so even though the first exon might not be the same, the other exons don?t allow for a longer ORF either. So the ORF you got was the longest possible given any combination of all exons (even if the first exon would have been made as UTR). So that confirms my suspicion that that particular exon was ignored because it breaks any possible reading frame. It likely contains an assembly error. ?Carson > On Jan 31, 2015, at 8:54 AM, ??? wrote: > > > There are two possibilities. Given how different the snap and augustus models are from one another, this would suggest they have not been trained appropriately (for example if you are picking another related organisms parameter file rather than training these programs, there are several assumptions that are being made that can actually make such an approach almost worse than just picking a parameter file at random). But more likely the evidence supported exon breaks the reading frame of the model. This usually indicates that you have an assembly error (possibly issues with homopolymers). No amount of evidence support will allow you to call an exon that generates a mis-sense causing frameshift, so the predictors do the next most reasonable thing - they drop the exon if another model is tenable. More concerning would be the mRNA-seq alignments near the 3? end of the gene call. The structure suggests significant capture of background transcription with the mRNA-seq reads (long UTRs with weird mini-introns). I would suggest not using cufflinks in this case. You should probably go with an assembly based approach of mRNA-seq reads instead. I would suggest using trinity. It will reduce sensitivity but greatly increase evidence specificity which is where you need the most improvement based on these images. I would also suggest using the jaccard_clip option with trinity. > > I would further suggest looking at the model in question using apollo, and manually adding the exon (click and drag it into the model). You can examine the reading frame after adding the exon and see if it is in fact a frameshift assembly error. If it?s a homopolymer derived frameshift, then you can expect a lot more of these throughout your assembly. > > I drag the exon into the model, there is a stop codon in it, it causes the region behind it become UTR, here: > > the question exon was pointed by red arrow. But the uppermost evidence is the completed EST from NCBI, and it contains start and stop codon. Then I noticed the 5' boundary of the 2nd codon in model is not the same as EST, so it makes frameshift, and cause the stop codon in the exon pointed by red arrow. The first exon should not be CDS, as there would be a start codon in 2nd exon if its 5' boundary is predicted correctly. Would "always_complete=1" fix it? > > I will try to use trinity. > > Also I do not see any protein alignments here? MAKER cannot work on transcript evidence alone. You need to provide the full proteome of at least two other species (they don?t have to be that closely related, but closer is better). Protein alignments will also help you better interpret the coding status of exons supported by mRNA-seq. For example in the second image, you would expect protein evidence to support all the coding exons but not the UTR exons which would remove any doubt as to whether an exon is really UTR or not. > > I did use 3 sources of protein evidence, one is proteome from related species, and one is proteome from fruitfly, and the last one is Swiss-prot. > > Thank you very much! > > Best regards, > Wenbo > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Feb 2 14:49:02 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 3 Feb 2015 08:49:02 +1100 Subject: [maker-devel] genome duplication? In-Reply-To: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp On 03/02/2015 3:36 AM, "Carson Holt" wrote: > MAKER requires every gene to have at least some evidence support. This is > very important for most most eukaryotes as false positive predictions will > dominate what is called by snap/augustus. However, it is not such a large > problem in fungi because of their high gene density and less frequent > introns. Setting keep_preds=1 will maximize sensitivity at the cost of > specificity (bad idea in most eukaryotes, but not so much in fungi). I > would not be surprised if a bias toward sensitivity is used by most fungi > annotation projects with every gene that can be annotated being annotated > (even if it does increase false positives). It is a tactic that can work > at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have > evidence support for all genes as the evidence alignments will not meet the > % coverage thresholds in the maker_bopts.ctl file. You may want to > separate out your shorter contigs, and annotate them separately with more > relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and > en_score_limit=. > > ?Carson > > > On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with > the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and > genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which > have little support in MAKER - I am not sure if their pipeline runs with > augustus/snap using informant hints though usually they are bringing RNAseq > into the mix - I don't know if your approach for reannotation assembled the > RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of > shared genes in the first 1KFG paper so we may be able to say with more > certainty of these extra predictions whether they are shared more widely > and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a >> great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size >> max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 >> 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling < >> mikael.durling at slu.se>: >> >>> Hi Xabier, >>> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >> >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), >>> with many contigs/scaffolds and based on CEGMA analysis only may indicate a >>> potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs >>>> # >>>> #Prots %Completeness - #Total Average %Ortho >>>> >>>> Complete 181 72.98 - 365 2.02 67.40 >>>> Partial 230 92.74 - 528 2.30 77.83 >>>> >>> >>> >>> Judging from these figure, you seem to have a very fragmented >>> assembly? What N50 have you reached? According to my experience, assemblies >>> with an N50 below 5-10 times the average gene length tend to give problems >>> in producing good gene sets. Not to say that the gene sets are unusable, >>> but for comparing e.g. gene complements to other species, it will be hard >>> to draw any conclusions when a high proportion of the genes are incomplete. >>> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in >>> comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related >>> fungi with nearly 90% of its genes present in at least two copies. >>> Paper: >>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I >>> trained SNAP and Augustus, and I generated a specific RepeatModeler >>> library. I recorded the predicted outputs from each Maker run (AED, number >>> of predicted proteins and transcripts...). Both Augustus and SNAP used to >>> give quite high number (~19000 and ~23000 respectively) in comparison with >>> the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, >>> how does maker deal with gene duplications? Or is this just a phenomenon >>> given that there is no support from the protein files provided initially to >>> Maker? I've used 4 different protein files for the annotation, could it be >>> that they weren't the best choices? I picked them from the closest >>> relatives and similar environments >>> >>> >>> Unless you by mistake filter out duplicated gene families as repeats >>> with repeat modeler, maker should not care about duplicated genes. However, >>> maker, without keep_preds=1, reports only genes with some kind of support >>> (be it EST or protein homology). This is rather conservative, but if you >>> enable keep_preds, you will get more genes as you have noted. Just for the >>> sake of comparison, I have reannotad more than ten genomes downloaded from >>> JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER >>> is reporting fewer gene models. I have yet to do a more thorough comparison >>> to tell what genes JGI are reporting that don?t appear in the MAKER >>> annotations. >>> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the >>> xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated >>> genomes from the JGI and most of them have two annotation folders >>> "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been >>> using the protein files found in the later as I expected to have real >>> evidence and a lower chance of being predicting false genes. Am I right? >>> >>> >>> Yes, I would say so. The FilteredModels have passed through their >>> model selection pipeline, while all_models contains models from all >>> predictors, as well as combinations of predictors and EST evidence. >>> >>> Just some 2 cents of observations of mine, >>> cheers, >>> Mikael >>> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 14:50:02 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 14:50:02 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: <441998AE-D660-485F-BAFD-44BD50765156@gmail.com> Anything less than 10kb. ?Carson > On Feb 2, 2015, at 2:49 PM, Xabier V?zquez Campos wrote: > > Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp > > On 03/02/2015 3:36 AM, "Carson Holt" > wrote: > MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. > > ?Carson > > >> On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: >> >> Xabier - >> FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) >> Hw version 1 asmbly - >> N50 9623; Max 71563 >> CEGMA for Hw1 >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 196 79.03 - 498 2.54 81.12 >> Partial 228 91.94 - 673 2.95 95.18 >> >> >> Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? >> >> We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. >> >> Jason >> >> Jason Stajich >> jason.stajich at gmail.com >> >> On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: >> Hi Xabier, >> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs # >>> #Prots %Completeness - #Total Average %Ortho >>> >>> Complete 181 72.98 - 365 2.02 67.40 >>> Partial 230 92.74 - 528 2.30 77.83 >> >> >> Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. >> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >>> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments >> >> Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. >> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? >> >> Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. >> >> Just some 2 cents of observations of mine, >> cheers, >> Mikael >> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 3 11:13:13 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 03 Feb 2015 10:13:13 -0800 (PST) Subject: [maker-devel] Est2Genome Problems Message-ID: <1422987193321.4df3c9d5@Nodemailer> Hi Folks, I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. ?I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. ?I even ran the accessory script gff3merge to check that the resulting file was properly formatted. For options, I set est2genome=1 and est_gff=cufflinks.gff. ?I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. ?Is there another option that I need to enable in order to use my est_gff file? ?I?m trying to get a set of genes to train the predictors for my next step. Any help would (as always) be greatly appreciated! Best, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Feb 5 07:37:27 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 5 Feb 2015 14:37:27 +0000 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Dear, I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? PS: can I add a question on the google group? I couldn?t start a new topic Thanks in advance, Arne Van Hoeck [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 09:27:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 09:27:41 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Message-ID: <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. ?Carson > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 10:22:12 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 10:22:12 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> Message-ID: <19D25327-4D46-44B4-854B-1BEEFBD23C06@gmail.com> I find that erring on the side of specificity works better for most annotation projects. But this is not always true, and you can try a few large contigs using an alignment approach like cufflinks and compare it to an assembly approach like trinity to decide which appears to perform better. Also you need to take into account the ultimate goal of the project. Some projects want to annotate absolutely everything and don?t care about false positives, while others want to maximize specificity and care more about having bad models. Often times this has to do with some planned downstream experiment that would be adversely affected by one or the other. I tend to prefer high specificity because MAKER?s automated approach to re-annotation means that if evidence ever presents itself later on that a real gene is missing, then that evidence automatically supports inclusion of the gene in the next automated release of the genome. But false models tends to persist and are harder to get rid of even though they lack any evidence support. These false models produced by sensitivity focused approaches then tend to poison downstream experiments and lead to more time being wasted by researchers. This is seen a lot in plant genomes where transposons and pseudogenes tend to pollute genome releases for historical reasons. Basically once they were in the genome release, then the burden of proof for removing them becomes higher than if they were never included in the first place. For researchers unaware of this, they may find they have been studying a transposon for weeks or months because some expression or variant analysis early on listed it as a canidate gene for some desired phenotype. MAKER can handle several hundred thousand contigs in the assembly, but in general contigs smaller than 10kb will not be annotatable (although smaller contigs can be used for gene dense organisms with short introns). It is better to exclude these short contigs from the analysis for processing efficiency. ?Carson > On Feb 5, 2015, at 9:52 AM, Van Hoeck Arne wrote: > > Thanks for this comprehensive and clear answer, Carlson. > > So in conclusion, it s better to make a concise file with very accurate transcripts (assembly method) instead of large possibly transcripts (map RNAseq data to reference) with contain more false positives. > > Another small question, can MAKER handle a lot of contigs (around 10.000) or is it better to make artificial chromosomes by pasting contigs to each other with an certain number N?s (let s say 1000 > exon length). > > Thanks a lot for your quick response > Arne > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: donderdag 5 februari 2015 17:28 > To: Van Hoeck Arne > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) > > There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. > > With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. > > What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. > > For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. > > There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. > > ?Carson > > > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne > wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Mon Feb 9 16:20:34 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 18:20:34 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker Message-ID: Greetings, I notices some cases in the output of Maker, that the ORF is not the longest one, e.g. the one below [image: ???? 1] If I manually correct it in Apollo as "calculate longes ORF", then it become [image: ???? 2] I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? Thanks very much! Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4523 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4351 bytes Desc: not available URL: From dence at genetics.utah.edu Mon Feb 9 17:14:45 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 10 Feb 2015 00:14:45 +0000 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. ~Daniel > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > Greetings, > > I notices some cases in the output of Maker, that the ORF is not the longest one, > e.g. the one below > > > If I manually correct it in Apollo as "calculate longes ORF", then it become > > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? > > Thanks very much! > > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Feb 9 19:06:31 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 21:06:31 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi Daniel, Thank you very much for suggestion. I used three predictors, SNAP, Augustus and pred_gff. >From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? [image: ???? 1] Thanks again Best regards, Wenbo 2015-02-09 19:14 GMT-05:00 Daniel Ence : > Hi, In the images that you sent, it looks like the ab-initio predictor had > predicted two ORF?s, while the evidence supported a single model. MAKER > doesn?t have an option to prefer longer models; it?s metric is to choose > the prediction that is best supported by the alignment evidence. > > How many ab-initio predictors did you use in generating the results that > you sent us? It looks like you only used one, which won?t give good results. > > ~Daniel > > > > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > > > Greetings, > > > > I notices some cases in the output of Maker, that the ORF is not the > longest one, > > e.g. the one below > > > > > > If I manually correct it in Apollo as "calculate longes ORF", then it > become > > > > I thought the updated one should make more sense. So how to let Maker > output the longest ORF automatically? > > > > Thanks very much! > > > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 8160 bytes Desc: not available URL: From carsonhh at gmail.com Mon Feb 9 19:22:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 9 Feb 2015 19:22:46 -0700 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: <669148EB-B537-4004-B323-D119B1056269@gmail.com> The gene model from maker is restricted to use the reading frame of the ab initio predictor. The better model would use a different reading frame. The augustus model has a missing exon so gets a lower score. Snap in general just looks bad. I'd say it needs to be retrained or maybe just drop Snap from the analysis. --Carson Sent from my iPhone > On Feb 9, 2015, at 7:06 PM, ??? wrote: > > Hi Daniel, > > Thank you very much for suggestion. > > I used three predictors, SNAP, Augustus and pred_gff. > From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? > > > > Thanks again > Best regards, > Wenbo > > 2015-02-09 19:14 GMT-05:00 Daniel Ence : >> Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. >> >> How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. >> >> ~Daniel >> >> >> > On Feb 9, 2015, at 4:20 PM, ??? wrote: >> > >> > Greetings, >> > >> > I notices some cases in the output of Maker, that the ORF is not the longest one, >> > e.g. the one below >> > >> > >> > If I manually correct it in Apollo as "calculate longes ORF", then it become >> > >> > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? >> > >> > Thanks very much! >> > >> > Wenbo >> > _______________________________________________ >> > maker-devel mailing list >> > maker-devel at box290.bluehost.com >> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 09:56:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 09:56:53 -0700 Subject: [maker-devel] Est2Genome Problems In-Reply-To: <1422987193321.4df3c9d5@Nodemailer> References: <1422987193321.4df3c9d5@Nodemailer> Message-ID: <119684F8-8071-4318-A129-3D90EC54242A@gmail.com> I ran a few est2genome runs with a cufflinks file i just generated and did not get any issues for EST based gene models. I?d like to at least have your test set to see if I can duplicate what you are seeing. Use this to upload the job files then I can just run it from my server here ?> http://weatherby.genetics.utah.edu/cgi-bin/mwas/bug.cgi ?Carson > On Feb 3, 2015, at 11:13 AM, Jason Gallant wrote: > > Hi Folks, > > I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. I even ran the accessory script gff3merge to check that the resulting file was properly formatted. > > For options, I set est2genome=1 and est_gff=cufflinks.gff. I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. Is there another option that I need to enable in order to use my est_gff file? I?m trying to get a set of genes to train the predictors for my next step. > > Any help would (as always) be greatly appreciated! > > Best, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Tue Feb 10 12:54:46 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 10 Feb 2015 11:54:46 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598085704.ad38b0a2@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful. ?I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 10 13:03:40 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 10 Feb 2015 12:03:40 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598620212.6519c2e@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful.? I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 13:04:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 13:04:15 -0700 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <8D30117C-88DF-4170-9CD8-590AAB79594D@gmail.com> This is awesome. Thanks for going through all the pain of figuring that out. I am definitely going to have to try annotating something through AWS now just to see how it compares to running on a local cluster. ?Carson > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Feb 10 20:37:51 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Wed, 11 Feb 2015 03:37:51 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From myandell at genetics.utah.edu Tue Feb 10 21:08:25 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 11 Feb 2015 04:08:25 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> References: <1423598085704.ad38b0a2@Nodemailer>, <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> Message-ID: <7A60AB257EFF2B48B1F4C814817EA053E372B1C5@mxb2.hg.genetics.utah.edu> Thanks so much Jason. Very informative and helpful for everyone. Cheers! --mark Mark Yandell Professor of Human Genetics H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR Center for Genetic Discovery Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:801-587-7707 ________________________________________ From: Barry Moore Sent: Tuesday, February 10, 2015 8:37 PM To: Jason Gallant; Mark Yandell Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Using MAKER MPI on Amazon Cloud This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From scott at scottcain.net Fri Feb 13 12:53:09 2015 From: scott at scottcain.net (Scott Cain) Date: Fri, 13 Feb 2015 14:53:09 -0500 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Hi Won, I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. Scott On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim wrote: > > Dear Anyone whom may it concern, > > Hello! > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > I try to find maker_tutorial files but I can?t. > > Here the online web site. > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > I just want to get maker_tutorial folder. > > I try to connect Amazon EC2 but there?s no AMI. > > Thank you for your help. > > Won > -- > Yim, Won Cheol > > MS330/Department of Biochemistry & Molecular Biology > > 1664 N. Virginia Street > > University of Nevada, Reno > > email: wyim at unr.edu > > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Feb 13 16:38:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Feb 2015 16:38:15 -0700 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Yes. You have to go to the EC2 management console (US East) and search for the AMI ?> https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images: Change the search options from ?Owned by me? to ?Pubic images? before you do the search. Then search for ami-907e97f8 You can see this on the GMOD MAKER course video where I do this at about the 58 minute timepoint ?> http://youtu.be/uA96tSSaqLk Make sure to increase the resolution to 1080p on the video. ?Carson > On Feb 13, 2015, at 12:53 PM, Scott Cain wrote: > > Hi Won, > > I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. > > Scott > > > > On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim > wrote: > > > > Dear Anyone whom may it concern, > > > > Hello! > > > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > > > I try to find maker_tutorial files but I can?t. > > > > Here the online web site. > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > > > I just want to get maker_tutorial folder. > > > > I try to connect Amazon EC2 but there?s no AMI. > > > > Thank you for your help. > > > > Won > > -- > > Yim, Won Cheol > > > > MS330/Department of Biochemistry & Molecular Biology > > > > 1664 N. Virginia Street > > > > University of Nevada, Reno > > > > email: wyim at unr.edu > > > > > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot net > GMOD Coordinator (http://gmod.org/ ) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Wed Feb 18 08:30:23 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Wed, 18 Feb 2015 16:30:23 +0100 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 19 12:28:22 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Feb 2015 12:28:22 -0700 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP In-Reply-To: References: Message-ID: I would recommend just using the trinity assembly. The cufflinks results tend to be messy. You shouldn?t need the est2genome or protein2genome results if you already trained using cegma results. You can then do one MAKER run (can be on just part of the genome) where you use both SNAP and Augustus as the predictors (est2genome and protein2genome should be turned off), and then give these results back to SNAP to train with again. This second round of bootstrap training is usually beneficial to SNAP (beyond two rounds doesn?t really help). Also don?t concatenate with previous training sets for the second round of bootstrap round of training. The idea is that the second round of training genes will be more correct than the first round, so you want to use them instead. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse. --Carson > On Feb 18, 2015, at 8:30 AM, Kai Kamm wrote: > > Hello > I have just started in this field of research and I want to annotate my assembled non-bilaterian invertebrate genome with Maker (100Mb in 7000 scaffolds) . > > I have red the maker tutorials but I am still a little uncertain about the iterative procedure. What I have already done is: > > - trained Augustus (using the web service) on the reference genome of a closely related species and its published dataset of "best transcripts" which are mainly based on gene prediction and some EST evidence. The published ESTs themselves were rejected from Augustus as being not sufficient for training (to few long transcripts). > - trained SNAP with the CEGMA-output of my genome > - assembled RNA-seq data with tophat/cufflinks and generated gff-file with cufflinks2gff > - de novo assembled RNA-seq data with Trinity > > I have already done some preliminary Maker runs with initially trained Augustus, SNAP and some protein evidence which had good results. > > Now my strategy is: > > running maker with > - the est2genome option using the cufflinks gff and the Trinity transcripts as EST evidence > > - the protein2genome option using a protein file including all proteins of the closely related species, a less related non-bilaterian species and a collection of reviewed Swiss-Prot entries from one representative mammal and all protostomes > > - Augustus and SNAP for gene prediction > > When this is done I want to: > > - create 2nd training set for SNAP from the merged gffs with maker2zff > - train Augustus again with the Maker transcripts using the Augustus web service > > And run Maker again > > Is this a reasonable procedure? Or am I missing some important aspects here? > Thanks in advance? > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Fri Feb 20 07:53:38 2015 From: marc.hoeppner at imbim.uu.se (=?utf-8?B?TWFyYyBIw7ZwcG5lcg==?=) Date: Fri, 20 Feb 2015 14:53:38 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? Message-ID: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Hi, we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: single_exon=1 single_length=100 I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). Maker version is 2.31-8 Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? Cheers, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Fri Feb 20 11:01:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Feb 2015 18:01:06 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: Hi Marc, Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. ~Daniel > On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: > > Hi, > > we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: > > single_exon=1 > single_length=100 > > I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. > > (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). > > Maker version is 2.31-8 > > Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? > > Cheers, > > Marc > > Marc P. Hoeppner, PhD > Team Leader > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Feb 20 11:07:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Feb 2015 11:07:28 -0700 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: <1DB3F36E-CAAE-4394-B6B6-53009E9566B4@gmail.com> Actually , the issue is that single exon genes have a higher threshold to meet to get UTR. Also MAKER will never add spliced UTR to a single exon gene, so an EST would also have to be single exon and encompass the entire single exon gene to get UTR. This is done because EST and mRNA-seq data in general is noisy enough that you will get mostly false UTR annotations otherwise. So it is an overly conservative approach because it?s the best of all the bad options. For spliced genes, the splice site can be used to confirm concordance of the UTR with the gene, but that can?t be done with single exon calls. ?Carson > On Feb 20, 2015, at 11:01 AM, Daniel Ence wrote: > > Hi Marc, > > Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? > > There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. > > Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. > > ~Daniel > > > > > >> On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: >> >> Hi, >> >> we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: >> >> single_exon=1 >> single_length=100 >> >> I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. >> >> (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). >> >> Maker version is 2.31-8 >> >> Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? >> >> Cheers, >> >> Marc >> >> Marc P. Hoeppner, PhD >> Team Leader >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at imbim.uu.se >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jgallant at msu.edu Wed Feb 25 09:43:21 2015 From: jgallant at msu.edu (Jason Gallant) Date: Wed, 25 Feb 2015 08:43:21 -0800 (PST) Subject: [maker-devel] Evaluating Genome Annotation Message-ID: <1424882600861.a6109243@Nodemailer> Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message).? This is a denovo genome assembly, for which there is no closely related species.? As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set ?of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation.? Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round.? I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence.? Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. ? Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration).? Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. ? Is my method of HMM construction to blame? 5) Am I worried about nothing here?? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Feb 25 10:25:30 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Feb 2015 10:25:30 -0700 Subject: [maker-devel] Evaluating Genome Annotation In-Reply-To: <1424882600861.a6109243@Nodemailer> References: <1424882600861.a6109243@Nodemailer> Message-ID: > Here are my questions: > 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 Your first round should over-predict especially if it is based off of cufflinks results (very noisy). Your second and third rounds look about right for many organisms (both should be similar in gene count), but if you believe it is low for yours then run CEGMA to estimate your genomes completeness (i.e. if your genome is 85% complete then you expect your final number from MAKER to represent about 85% of the true number of genes). Also you may want to increase your protein database. If the refseq genes you are using represent just a subset of the 3 vertebrate genomes rather than the whole genomes of those organisms, then you will want to get a couple of full genomes to work with. Also not having a high completion level genome on vertebrates in now out of the ordinary. In lamprey (an extreme case) the low completion level actually lead to the discovery that it?s cells undergo programed somatic deletion of about 25% of the genome, and since since it?s genome was sequenced off of the somatic tissue, it was obviously missing from the assembly. > 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings That?s what you expect. The third round should show just minor improvement (AED is not a highly precise number so a difference of 1% basically means the second and third round results are identical for evidence support). The real improvement from second round to third round is the quality of the unaided SNAP models (you really only get a sense of this by using apollo to view a few contigs). Because the MAKER models are derived from evidence based hints, they will always be similar between runs, but the raw SNAP models in round 3 will be much more like the MAKER models that the unaided SNAP models from round 2. This convergence helps you know that you gene predictor is trained. You may also want to train Augustus and add that to your set of predictors (look for convergence between MAKER, SNAP, and Augustus models to indicate training has worked). Augustus generally performs better on vertebrates than SNAP. On some vertebrates you actually have to just drop SNAP completely (SNAP runs very poorly on the human genome for example). On genomes where you drop SNAP then you would just use Augustus (look at evidence alignments and convergence between MAKER/SNAP/Augustus models to make that decision). > 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? The default threshold for consideration is 1 bp. But when you actually run the predictors you will realize that they cannot physically put a multi exon gene in contigs bellow about 10kp in length. So MAKER will run them, but you just won?t get any results. > 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? Your HMM?s are probably fine (look for a convergence between SNAP raw and MAKER evidence based models to see if SNAP is behaving well). I think you probably need a better protein database, perhaps need to improve repeat masking as well (try running repeat modeler - I can?t overstate the importance of this since repeats can essentially break a gene predictor). Try adding Augustus to the analysis. Also in general, I?ve found that cufflinks processed evidence is far too noisy and it adversely affects results of annotation. Try processing the transcript data with Trinity instead (you will get better gene models). I doubt additional training of SNAP is necessary. > 5) Am I worried about nothing here? Is this a pretty decent annotation? A reasonable expectation of accuracy for a first draft genome is probably in the upper 70?s to high 80?s. Extremely high quality assemblies with lots of good transcript data might break into the 90?s. For example more than 40% of the genes from the original draft of the mouse genome have since been thrown out over time (http://www.biomedcentral.com/1471-2105/10/67 ). The total gene count has remained similar, but those counts are actually based off of new genes in new locations in the genome. Also the honeybee genome recently got major improvements in there annotations (50% increase in gene count) after fixing problems with the original assembly and annotation process (http://www.biomedcentral.com/1471-2164/15/86 ). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Wed Feb 25 09:40:40 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Wed, 25 Feb 2015 11:40:40 -0500 Subject: [maker-devel] Evaluating Genome Annotation Message-ID: Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message). This is a denovo genome assembly, for which there is no closely related species. As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation. Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round. I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence. Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? 5) Am I worried about nothing here? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 09:35:59 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:35:59 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: Message-ID: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. ?Carson > On Jan 31, 2015, at 4:21 PM, Jason Stajich wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: > Thanks Mikael, > > This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either > > n n:500 n:N50 min N80 N50 N20 E-size max sum > 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 > > > > 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: > Hi Xabier, > >> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >> >> Hi all, >> >> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >> >> # Statistics of the completeness of the genome based on 248 CEGs # >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 181 72.98 - 365 2.02 67.40 >> Partial 230 92.74 - 528 2.30 77.83 > > > Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. > >> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >> >> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments > > Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. > >> >> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >> >> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? > > Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. > > Just some 2 cents of observations of mine, > cheers, > Mikael > >> >> Thank you in advance, >> >> Xabier >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 09:40:06 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 09:40:06 -0700 Subject: [maker-devel] How to improve the result of Maker In-Reply-To: References: <492A6635-67E9-4700-B544-E137C4248E55@gmail.com> Message-ID: <1F69B446-8899-41BE-BFB8-5DA61BB359A8@gmail.com> When you add a new exon, apollo will always recalculate the reading frame to take the longest ORF, so even though the first exon might not be the same, the other exons don?t allow for a longer ORF either. So the ORF you got was the longest possible given any combination of all exons (even if the first exon would have been made as UTR). So that confirms my suspicion that that particular exon was ignored because it breaks any possible reading frame. It likely contains an assembly error. ?Carson > On Jan 31, 2015, at 8:54 AM, ??? wrote: > > > There are two possibilities. Given how different the snap and augustus models are from one another, this would suggest they have not been trained appropriately (for example if you are picking another related organisms parameter file rather than training these programs, there are several assumptions that are being made that can actually make such an approach almost worse than just picking a parameter file at random). But more likely the evidence supported exon breaks the reading frame of the model. This usually indicates that you have an assembly error (possibly issues with homopolymers). No amount of evidence support will allow you to call an exon that generates a mis-sense causing frameshift, so the predictors do the next most reasonable thing - they drop the exon if another model is tenable. More concerning would be the mRNA-seq alignments near the 3? end of the gene call. The structure suggests significant capture of background transcription with the mRNA-seq reads (long UTRs with weird mini-introns). I would suggest not using cufflinks in this case. You should probably go with an assembly based approach of mRNA-seq reads instead. I would suggest using trinity. It will reduce sensitivity but greatly increase evidence specificity which is where you need the most improvement based on these images. I would also suggest using the jaccard_clip option with trinity. > > I would further suggest looking at the model in question using apollo, and manually adding the exon (click and drag it into the model). You can examine the reading frame after adding the exon and see if it is in fact a frameshift assembly error. If it?s a homopolymer derived frameshift, then you can expect a lot more of these throughout your assembly. > > I drag the exon into the model, there is a stop codon in it, it causes the region behind it become UTR, here: > > the question exon was pointed by red arrow. But the uppermost evidence is the completed EST from NCBI, and it contains start and stop codon. Then I noticed the 5' boundary of the 2nd codon in model is not the same as EST, so it makes frameshift, and cause the stop codon in the exon pointed by red arrow. The first exon should not be CDS, as there would be a start codon in 2nd exon if its 5' boundary is predicted correctly. Would "always_complete=1" fix it? > > I will try to use trinity. > > Also I do not see any protein alignments here? MAKER cannot work on transcript evidence alone. You need to provide the full proteome of at least two other species (they don?t have to be that closely related, but closer is better). Protein alignments will also help you better interpret the coding status of exons supported by mRNA-seq. For example in the second image, you would expect protein evidence to support all the coding exons but not the UTR exons which would remove any doubt as to whether an exon is really UTR or not. > > I did use 3 sources of protein evidence, one is proteome from related species, and one is proteome from fruitfly, and the last one is Swiss-prot. > > Thank you very much! > > Best regards, > Wenbo > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Feb 2 14:49:02 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 3 Feb 2015 08:49:02 +1100 Subject: [maker-devel] genome duplication? In-Reply-To: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp On 03/02/2015 3:36 AM, "Carson Holt" wrote: > MAKER requires every gene to have at least some evidence support. This is > very important for most most eukaryotes as false positive predictions will > dominate what is called by snap/augustus. However, it is not such a large > problem in fungi because of their high gene density and less frequent > introns. Setting keep_preds=1 will maximize sensitivity at the cost of > specificity (bad idea in most eukaryotes, but not so much in fungi). I > would not be surprised if a bias toward sensitivity is used by most fungi > annotation projects with every gene that can be annotated being annotated > (even if it does increase false positives). It is a tactic that can work > at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have > evidence support for all genes as the evidence alignments will not meet the > % coverage thresholds in the maker_bopts.ctl file. You may want to > separate out your shorter contigs, and annotate them separately with more > relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and > en_score_limit=. > > ?Carson > > > On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: > > Xabier - > FYI - though you probably already compared, those stats are on par with > the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and > genome size is still same range supporting the duplication hypothesis) > Hw version 1 asmbly - > N50 9623; Max 71563 > CEGMA for Hw1 > #Prots %Completeness - #Total Average %Ortho > > Complete 196 79.03 - 498 2.54 81.12 > Partial 228 91.94 - 673 2.95 95.18 > > > Mikael - yes - we should compare notes on the models JGI is calling which > have little support in MAKER - I am not sure if their pipeline runs with > augustus/snap using informant hints though usually they are bringing RNAseq > into the mix - I don't know if your approach for reannotation assembled the > RNAseq and used it as evidence? > > We'll be trying to assess some of this when comparisons of proportion of > shared genes in the first 1KFG paper so we may be able to say with more > certainty of these extra predictions whether they are shared more widely > and get a handle on singleton/false positives rates. > > Jason > > Jason Stajich > jason.stajich at gmail.com > > On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a >> great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size >> max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 >> 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling < >> mikael.durling at slu.se>: >> >>> Hi Xabier, >>> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >> >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), >>> with many contigs/scaffolds and based on CEGMA analysis only may indicate a >>> potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs >>>> # >>>> #Prots %Completeness - #Total Average %Ortho >>>> >>>> Complete 181 72.98 - 365 2.02 67.40 >>>> Partial 230 92.74 - 528 2.30 77.83 >>>> >>> >>> >>> Judging from these figure, you seem to have a very fragmented >>> assembly? What N50 have you reached? According to my experience, assemblies >>> with an N50 below 5-10 times the average gene length tend to give problems >>> in producing good gene sets. Not to say that the gene sets are unusable, >>> but for comparing e.g. gene complements to other species, it will be hard >>> to draw any conclusions when a high proportion of the genes are incomplete. >>> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in >>> comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related >>> fungi with nearly 90% of its genes present in at least two copies. >>> Paper: >>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I >>> trained SNAP and Augustus, and I generated a specific RepeatModeler >>> library. I recorded the predicted outputs from each Maker run (AED, number >>> of predicted proteins and transcripts...). Both Augustus and SNAP used to >>> give quite high number (~19000 and ~23000 respectively) in comparison with >>> the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, >>> how does maker deal with gene duplications? Or is this just a phenomenon >>> given that there is no support from the protein files provided initially to >>> Maker? I've used 4 different protein files for the annotation, could it be >>> that they weren't the best choices? I picked them from the closest >>> relatives and similar environments >>> >>> >>> Unless you by mistake filter out duplicated gene families as repeats >>> with repeat modeler, maker should not care about duplicated genes. However, >>> maker, without keep_preds=1, reports only genes with some kind of support >>> (be it EST or protein homology). This is rather conservative, but if you >>> enable keep_preds, you will get more genes as you have noted. Just for the >>> sake of comparison, I have reannotad more than ten genomes downloaded from >>> JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER >>> is reporting fewer gene models. I have yet to do a more thorough comparison >>> to tell what genes JGI are reporting that don?t appear in the MAKER >>> annotations. >>> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the >>> xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated >>> genomes from the JGI and most of them have two annotation folders >>> "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been >>> using the protein files found in the later as I expected to have real >>> evidence and a lower chance of being predicting false genes. Am I right? >>> >>> >>> Yes, I would say so. The FilteredModels have passed through their >>> model selection pipeline, while all_models contains models from all >>> predictors, as well as combinations of predictors and EST evidence. >>> >>> Just some 2 cents of observations of mine, >>> cheers, >>> Mikael >>> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Feb 2 14:50:02 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 2 Feb 2015 14:50:02 -0700 Subject: [maker-devel] genome duplication? In-Reply-To: References: <54B3C36E-5537-4C53-9E2C-F9D60914E786@gmail.com> Message-ID: <441998AE-D660-485F-BAFD-44BD50765156@gmail.com> Anything less than 10kb. ?Carson > On Feb 2, 2015, at 2:49 PM, Xabier V?zquez Campos wrote: > > Thanks Carson. Any suggestion on the size limit to separate the short contigs, eg <500 bp > > On 03/02/2015 3:36 AM, "Carson Holt" > wrote: > MAKER requires every gene to have at least some evidence support. This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus. However, it is not such a large problem in fungi because of their high gene density and less frequent introns. Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi). I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives). It is a tactic that can work at least in fungi. > > Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file. You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=. > > ?Carson > > >> On Jan 31, 2015, at 4:21 PM, Jason Stajich > wrote: >> >> Xabier - >> FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) >> Hw version 1 asmbly - >> N50 9623; Max 71563 >> CEGMA for Hw1 >> #Prots %Completeness - #Total Average %Ortho >> >> Complete 196 79.03 - 498 2.54 81.12 >> Partial 228 91.94 - 673 2.95 95.18 >> >> >> Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence? >> >> We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates. >> >> Jason >> >> Jason Stajich >> jason.stajich at gmail.com >> >> On Sat, Jan 31, 2015 at 12:51 AM, Xabier V?zquez Campos > wrote: >> Thanks Mikael, >> >> This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either >> >> n n:500 n:N50 min N80 N50 N20 E-size max sum >> 14277 7099 1185 500 4698 10771 20438 14530 154519 42.68e6 >> >> >> >> 2015-01-31 19:42 GMT+11:00 Mikael Brandstr?m Durling >: >> Hi Xabier, >> >>> 31 jan 2015 kl. 05:48 skrev Xabier V?zquez Campos >: >>> >>> Hi all, >>> >>> One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome >>> >>> # Statistics of the completeness of the genome based on 248 CEGs # >>> #Prots %Completeness - #Total Average %Ortho >>> >>> Complete 181 72.98 - 365 2.02 67.40 >>> Partial 230 92.74 - 528 2.30 77.83 >> >> >> Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete. >> >>> The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with Hortaea werneckii (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies. >>> Paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328 >>> >>> Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...). Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments >> >> Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology). This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don?t appear in the MAKER annotations. >> >>> >>> So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to >>> >>> Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right? >> >> Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence. >> >> Just some 2 cents of observations of mine, >> cheers, >> Mikael >> >>> >>> Thank you in advance, >>> >>> Xabier >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 3 11:13:13 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 03 Feb 2015 10:13:13 -0800 (PST) Subject: [maker-devel] Est2Genome Problems Message-ID: <1422987193321.4df3c9d5@Nodemailer> Hi Folks, I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. ?I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. ?I even ran the accessory script gff3merge to check that the resulting file was properly formatted. For options, I set est2genome=1 and est_gff=cufflinks.gff. ?I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. ?Is there another option that I need to enable in order to use my est_gff file? ?I?m trying to get a set of genes to train the predictors for my next step. Any help would (as always) be greatly appreciated! Best, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Feb 5 07:37:27 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 5 Feb 2015 14:37:27 +0000 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Dear, I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? PS: can I add a question on the google group? I couldn?t start a new topic Thanks in advance, Arne Van Hoeck [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 09:27:41 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 09:27:41 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> Message-ID: <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. ?Carson > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 5 10:22:12 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 5 Feb 2015 10:22:12 -0700 Subject: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117763B0@MAILSRV3.sck.be> <3A5B068C-C6F2-479C-8874-0C0C0F008FBD@gmail.com> <9BCA01D5BDC2AF46822CA182B4FBD0DF11776439@MAILSRV3.sck.be> Message-ID: <19D25327-4D46-44B4-854B-1BEEFBD23C06@gmail.com> I find that erring on the side of specificity works better for most annotation projects. But this is not always true, and you can try a few large contigs using an alignment approach like cufflinks and compare it to an assembly approach like trinity to decide which appears to perform better. Also you need to take into account the ultimate goal of the project. Some projects want to annotate absolutely everything and don?t care about false positives, while others want to maximize specificity and care more about having bad models. Often times this has to do with some planned downstream experiment that would be adversely affected by one or the other. I tend to prefer high specificity because MAKER?s automated approach to re-annotation means that if evidence ever presents itself later on that a real gene is missing, then that evidence automatically supports inclusion of the gene in the next automated release of the genome. But false models tends to persist and are harder to get rid of even though they lack any evidence support. These false models produced by sensitivity focused approaches then tend to poison downstream experiments and lead to more time being wasted by researchers. This is seen a lot in plant genomes where transposons and pseudogenes tend to pollute genome releases for historical reasons. Basically once they were in the genome release, then the burden of proof for removing them becomes higher than if they were never included in the first place. For researchers unaware of this, they may find they have been studying a transposon for weeks or months because some expression or variant analysis early on listed it as a canidate gene for some desired phenotype. MAKER can handle several hundred thousand contigs in the assembly, but in general contigs smaller than 10kb will not be annotatable (although smaller contigs can be used for gene dense organisms with short introns). It is better to exclude these short contigs from the analysis for processing efficiency. ?Carson > On Feb 5, 2015, at 9:52 AM, Van Hoeck Arne wrote: > > Thanks for this comprehensive and clear answer, Carlson. > > So in conclusion, it s better to make a concise file with very accurate transcripts (assembly method) instead of large possibly transcripts (map RNAseq data to reference) with contain more false positives. > > Another small question, can MAKER handle a lot of contigs (around 10.000) or is it better to make artificial chromosomes by pasting contigs to each other with an certain number N?s (let s say 1000 > exon length). > > Thanks a lot for your quick response > Arne > From: Carson Holt [mailto:carsonhh at gmail.com ] > Sent: donderdag 5 februari 2015 17:28 > To: Van Hoeck Arne > Cc: maker-devel at yandell-lab.org > Subject: Re: [maker-devel] Maker-P with RNAseq (denovo assembly or mapping to reference?) > > There is no requirement or even specification in MAKER that transcript data be sequenced or processed in any specific way. How this is done is completely dependent on the user and the project (different strategies work better in different organisms). The only requirement for MAKER is that you have some form of transcript evidence. The utility of transcript evidence is primarily for identification of introns and splice sites. If evidence is supplied as fasta sequence, then it will be aligned around splice sites using exonerate (a splice aware aligner). If the evidence is processed via another tool, you are expected to supply it in GFF3 format with the evidence already aligned around splice sites (examples of this include cufflinks and BLAT). How you generate these FASTA files or GFF3 files is up to you. Remember that final models are not strictly based on the transcript data (It?s just of piece of the puzzle). Rather final models will be generated from the combination of transcript alignments, protein alignments, and the HMMs from algorithms such as SNAP and Augustus that must have been trained on the species being annotated. > > With respect to most effective methods from processing mRNA-seq data, that depends on many factors. But in general read mapping to the genome will result in greater sensitivity but lower specificity. There will be a lot of false alignments and alignments of background transcription (>98% of the genome is expressed at detectable levels, so it is not just the genes). Assembly based methods on the other hand result in very high specificity, but lower sensitivity. Assembly based methods can also better resolve issues of overlapping UTR and genes overlapping on opposite strand than the alignment based methods (trinity?s jaccard_clip option for example). A loss in sensitivity can also be compensated for with matched protein data and well trained HMMs. > > What analysis methods are are more effective will be determined by factors such as gene density, average intron lengths, repeat content, etc. You can try both methods and see which appears to work better for your organism. Both have their trade offs. > > For sequencing, because of the physics involved in RNA folding, fragmentation of the transcriptome improves sensitivity but can also precludes the use of long paired end reads (this is because fragments end up being too short). Short reads also reduce specificity and can align more randomly to the genome (but the gain in sensitivity from fragmentation usually far outweighs the loss of specificity from the short reads). Larger fragments of RNA or no fragmentation of the RNA allows for longer reads and paired end sequencing but will result in lower sensitivity and recovery of the transcriptome. This is because RNA folding which occurs on larger fragments physically blocks reverse transcriptase. You will also end up recovering mostly the 5? end of genes because the inhibition of reverse transcriptase results in a 5? bias to the sequencing reaction. The read coverage graphs end up looking like ski slopes, and longer genes will not get coverage on the 3? end. > > There are different strategies for transcriptome sequencing and sample preparation to deal with strandedness and sequencing bias. Each has it?s trade off, and what strategy you use depends on what you plan on using the sequenced reads for as well as the structure of that organisms genome and trascriptome. > > ?Carson > > > On Feb 5, 2015, at 7:37 AM, Van Hoeck Arne > wrote: > > Dear, > > I have read the manuscript on the MAKER-P tool. I?m finishing the last part of my plant genome assembly and MAKER-P will be used for determine gene annotations. I have read the wiki tutorials and your manuscript but one thing comes to my mind on evidence sources when RNAseq data is available for gene discovery (like we have). : > > Why do you perform a RNA denovo assembly with individual RNAseq runs. As I could understand from your paper (a tool kit for the raped creation,?), the RNAseq reads were short single reads. This includes that it is not easy for denovo RNA assemblers (like trinity) to find high quality genes since a lot of reads will be lost because of the low coverage. If 2*200 PE reads were used for RNAseq, this would create more genes. > > Therefore, as an alternative: why not mapping the RNAseq reads to your reference genome. Extract this as a fasta file and cutoff all the reads shorter than 150 bp long. If you concentate all the RNAseq data to one file before the mapping, all your expressed RNA is mapped and included in the fasta file that you need for gene discovery. > > From my opinion, the low-expressed reads will be lost during the assembly. Or am I overlooking something? > We have sequenced our genome with PE illumina data. For DE analyses, we did run 48 single reads 50 bp , lower coverage. > Therefore, is assembling the RNAseq data still better than mapping the data to our reference genome? > > PS: can I add a question on the google group? I couldn?t start a new topic > > Thanks in advance, > Arne Van Hoeck > > > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenwenbo1020 at gmail.com Mon Feb 9 16:20:34 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 18:20:34 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker Message-ID: Greetings, I notices some cases in the output of Maker, that the ORF is not the longest one, e.g. the one below [image: ???? 1] If I manually correct it in Apollo as "calculate longes ORF", then it become [image: ???? 2] I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? Thanks very much! Wenbo -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4523 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4351 bytes Desc: not available URL: From dence at genetics.utah.edu Mon Feb 9 17:14:45 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Tue, 10 Feb 2015 00:14:45 +0000 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. ~Daniel > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > Greetings, > > I notices some cases in the output of Maker, that the ORF is not the longest one, > e.g. the one below > > > If I manually correct it in Apollo as "calculate longes ORF", then it become > > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? > > Thanks very much! > > Wenbo > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From chenwenbo1020 at gmail.com Mon Feb 9 19:06:31 2015 From: chenwenbo1020 at gmail.com (=?UTF-8?B?6ZmI5paH5Y2a?=) Date: Mon, 9 Feb 2015 21:06:31 -0500 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: Hi Daniel, Thank you very much for suggestion. I used three predictors, SNAP, Augustus and pred_gff. >From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? [image: ???? 1] Thanks again Best regards, Wenbo 2015-02-09 19:14 GMT-05:00 Daniel Ence : > Hi, In the images that you sent, it looks like the ab-initio predictor had > predicted two ORF?s, while the evidence supported a single model. MAKER > doesn?t have an option to prefer longer models; it?s metric is to choose > the prediction that is best supported by the alignment evidence. > > How many ab-initio predictors did you use in generating the results that > you sent us? It looks like you only used one, which won?t give good results. > > ~Daniel > > > > On Feb 9, 2015, at 4:20 PM, ??? wrote: > > > > Greetings, > > > > I notices some cases in the output of Maker, that the ORF is not the > longest one, > > e.g. the one below > > > > > > If I manually correct it in Apollo as "calculate longes ORF", then it > become > > > > I thought the updated one should make more sense. So how to let Maker > output the longest ORF automatically? > > > > Thanks very much! > > > > Wenbo > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 8160 bytes Desc: not available URL: From carsonhh at gmail.com Mon Feb 9 19:22:46 2015 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 9 Feb 2015 19:22:46 -0700 Subject: [maker-devel] How to set "calculate longest ORF" in Maker In-Reply-To: References: Message-ID: <669148EB-B537-4004-B323-D119B1056269@gmail.com> The gene model from maker is restricted to use the reading frame of the ab initio predictor. The better model would use a different reading frame. The augustus model has a missing exon so gets a lower score. Snap in general just looks bad. I'd say it needs to be retrained or maybe just drop Snap from the analysis. --Carson Sent from my iPhone > On Feb 9, 2015, at 7:06 PM, ??? wrote: > > Hi Daniel, > > Thank you very much for suggestion. > > I used three predictors, SNAP, Augustus and pred_gff. > From the image, I thought the prediction of Augustus(color is pink) matched the evidence better, why maker choose SNAP's ? > > > > Thanks again > Best regards, > Wenbo > > 2015-02-09 19:14 GMT-05:00 Daniel Ence : >> Hi, In the images that you sent, it looks like the ab-initio predictor had predicted two ORF?s, while the evidence supported a single model. MAKER doesn?t have an option to prefer longer models; it?s metric is to choose the prediction that is best supported by the alignment evidence. >> >> How many ab-initio predictors did you use in generating the results that you sent us? It looks like you only used one, which won?t give good results. >> >> ~Daniel >> >> >> > On Feb 9, 2015, at 4:20 PM, ??? wrote: >> > >> > Greetings, >> > >> > I notices some cases in the output of Maker, that the ORF is not the longest one, >> > e.g. the one below >> > >> > >> > If I manually correct it in Apollo as "calculate longes ORF", then it become >> > >> > I thought the updated one should make more sense. So how to let Maker output the longest ORF automatically? >> > >> > Thanks very much! >> > >> > Wenbo >> > _______________________________________________ >> > maker-devel mailing list >> > maker-devel at box290.bluehost.com >> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 09:56:53 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 09:56:53 -0700 Subject: [maker-devel] Est2Genome Problems In-Reply-To: <1422987193321.4df3c9d5@Nodemailer> References: <1422987193321.4df3c9d5@Nodemailer> Message-ID: <119684F8-8071-4318-A129-3D90EC54242A@gmail.com> I ran a few est2genome runs with a cufflinks file i just generated and did not get any issues for EST based gene models. I?d like to at least have your test set to see if I can duplicate what you are seeing. Use this to upload the job files then I can just run it from my server here ?> http://weatherby.genetics.utah.edu/cgi-bin/mwas/bug.cgi ?Carson > On Feb 3, 2015, at 11:13 AM, Jason Gallant wrote: > > Hi Folks, > > I?ve nearly succeeded at getting MAKER to run on AWS? I?ve been checking the output files, and have noticed that none of my RNAseq data was incorporated on the run. I used Cufflinks to perform alignments of libraries from several tissues, ran the accessory script cufflinks2gff3 for each tissue, then concatenated the resulting gff3 files. I even ran the accessory script gff3merge to check that the resulting file was properly formatted. > > For options, I set est2genome=1 and est_gff=cufflinks.gff. I only get protein2genome and repeatmasker evidence in my resulting maker gff3 file, and the genes predicted by these. Is there another option that I need to enable in order to use my est_gff file? I?m trying to get a set of genes to train the predictors for my next step. > > Any help would (as always) be greatly appreciated! > > Best, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Tue Feb 10 12:54:46 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Tue, 10 Feb 2015 11:54:46 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598085704.ad38b0a2@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful. ?I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgallant at msu.edu Tue Feb 10 13:03:40 2015 From: jgallant at msu.edu (Jason Gallant) Date: Tue, 10 Feb 2015 12:03:40 -0800 (PST) Subject: [maker-devel] Using MAKER MPI on Amazon Cloud Message-ID: <1423598620212.6519c2e@Nodemailer> Hi Folks, My experiments with AWS and MAKER have proven successful.? I?ve been working on documenting the experiment on my blog, which is now published here: http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ Please let me know if you find this useful, have any questions or feedback! Best Regards, Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Feb 10 13:04:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 10 Feb 2015 13:04:15 -0700 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <8D30117C-88DF-4170-9CD8-590AAB79594D@gmail.com> This is awesome. Thanks for going through all the pain of figuring that out. I am definitely going to have to try annotating something through AWS now just to see how it compares to running on a local cluster. ?Carson > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From bmoore at genetics.utah.edu Tue Feb 10 20:37:51 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Wed, 11 Feb 2015 03:37:51 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <1423598085704.ad38b0a2@Nodemailer> References: <1423598085704.ad38b0a2@Nodemailer> Message-ID: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From myandell at genetics.utah.edu Tue Feb 10 21:08:25 2015 From: myandell at genetics.utah.edu (Mark Yandell) Date: Wed, 11 Feb 2015 04:08:25 +0000 Subject: [maker-devel] Using MAKER MPI on Amazon Cloud In-Reply-To: <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> References: <1423598085704.ad38b0a2@Nodemailer>, <695E9440-08A7-4A68-8414-3AD140929635@genetics.utah.edu> Message-ID: <7A60AB257EFF2B48B1F4C814817EA053E372B1C5@mxb2.hg.genetics.utah.edu> Thanks so much Jason. Very informative and helpful for everyone. Cheers! --mark Mark Yandell Professor of Human Genetics H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR Center for Genetic Discovery Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:801-587-7707 ________________________________________ From: Barry Moore Sent: Tuesday, February 10, 2015 8:37 PM To: Jason Gallant; Mark Yandell Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Using MAKER MPI on Amazon Cloud This is very cool Jason - thanks so much for this outstanding contribution to the annotation community! I?ve added prominent link to your tutorial off of the MAKER wiki front page. B > On Feb 10, 2015, at 12:54 PM, Jason Gallant wrote: > > Hi Folks, > > My experiments with AWS and MAKER have proven successful. I?ve been working on documenting the experiment on my blog, which is now published here: > > http://efish.zoology.msu.edu/running-maker-genome-annotation-on-starcluster/ > > Please let me know if you find this useful, have any questions or feedback! > > Best Regards, > Jason Gallant > > ? > Dr. Jason R. Gallant > Assistant Professor > Room 38 Natural Sciences > Department of Zoology > Michigan State University > East Lansing, MI 48824 > jgallant at msu.edu > office: 517-884-7756 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From scott at scottcain.net Fri Feb 13 12:53:09 2015 From: scott at scottcain.net (Scott Cain) Date: Fri, 13 Feb 2015 14:53:09 -0500 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Hi Won, I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. Scott On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim wrote: > > Dear Anyone whom may it concern, > > Hello! > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > I try to find maker_tutorial files but I can?t. > > Here the online web site. > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > I just want to get maker_tutorial folder. > > I try to connect Amazon EC2 but there?s no AMI. > > Thank you for your help. > > Won > -- > Yim, Won Cheol > > MS330/Department of Biochemistry & Molecular Biology > > 1664 N. Virginia Street > > University of Nevada, Reno > > email: wyim at unr.edu > > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Feb 13 16:38:15 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 13 Feb 2015 16:38:15 -0700 Subject: [maker-devel] Maker tutorial In-Reply-To: References: Message-ID: Yes. You have to go to the EC2 management console (US East) and search for the AMI ?> https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images: Change the search options from ?Owned by me? to ?Pubic images? before you do the search. Then search for ami-907e97f8 You can see this on the GMOD MAKER course video where I do this at about the 58 minute timepoint ?> http://youtu.be/uA96tSSaqLk Make sure to increase the resolution to 1080p on the video. ?Carson > On Feb 13, 2015, at 12:53 PM, Scott Cain wrote: > > Hi Won, > > I'm cc'ing your email to the MAKER mailing list; I'm sure they'll be able to help you find the file. If you want the AMI, I'm reasonably sure the one you're looking for is ami-907e97f8 in US-east. > > Scott > > > > On Fri, Feb 13, 2015 at 2:43 PM, Won C Yim > wrote: > > > > Dear Anyone whom may it concern, > > > > Hello! > > > > My name is Won Cheol Yim, postdoc in University of Nevada,Reno. > > > > I try to find maker_tutorial files but I can?t. > > > > Here the online web site. > > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 > > > > I just want to get maker_tutorial folder. > > > > I try to connect Amazon EC2 but there?s no AMI. > > > > Thank you for your help. > > > > Won > > -- > > Yim, Won Cheol > > > > MS330/Department of Biochemistry & Molecular Biology > > > > 1664 N. Virginia Street > > > > University of Nevada, Reno > > > > email: wyim at unr.edu > > > > > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at scottcain dot net > GMOD Coordinator (http://gmod.org/ ) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Wed Feb 18 08:30:23 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Wed, 18 Feb 2015 16:30:23 +0100 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Feb 19 12:28:22 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 19 Feb 2015 12:28:22 -0700 Subject: [maker-devel] Improving gene prediction with Augustus and SNAP In-Reply-To: References: Message-ID: I would recommend just using the trinity assembly. The cufflinks results tend to be messy. You shouldn?t need the est2genome or protein2genome results if you already trained using cegma results. You can then do one MAKER run (can be on just part of the genome) where you use both SNAP and Augustus as the predictors (est2genome and protein2genome should be turned off), and then give these results back to SNAP to train with again. This second round of bootstrap training is usually beneficial to SNAP (beyond two rounds doesn?t really help). Also don?t concatenate with previous training sets for the second round of bootstrap round of training. The idea is that the second round of training genes will be more correct than the first round, so you want to use them instead. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse. --Carson > On Feb 18, 2015, at 8:30 AM, Kai Kamm wrote: > > Hello > I have just started in this field of research and I want to annotate my assembled non-bilaterian invertebrate genome with Maker (100Mb in 7000 scaffolds) . > > I have red the maker tutorials but I am still a little uncertain about the iterative procedure. What I have already done is: > > - trained Augustus (using the web service) on the reference genome of a closely related species and its published dataset of "best transcripts" which are mainly based on gene prediction and some EST evidence. The published ESTs themselves were rejected from Augustus as being not sufficient for training (to few long transcripts). > - trained SNAP with the CEGMA-output of my genome > - assembled RNA-seq data with tophat/cufflinks and generated gff-file with cufflinks2gff > - de novo assembled RNA-seq data with Trinity > > I have already done some preliminary Maker runs with initially trained Augustus, SNAP and some protein evidence which had good results. > > Now my strategy is: > > running maker with > - the est2genome option using the cufflinks gff and the Trinity transcripts as EST evidence > > - the protein2genome option using a protein file including all proteins of the closely related species, a less related non-bilaterian species and a collection of reviewed Swiss-Prot entries from one representative mammal and all protostomes > > - Augustus and SNAP for gene prediction > > When this is done I want to: > > - create 2nd training set for SNAP from the merged gffs with maker2zff > - train Augustus again with the Maker transcripts using the Augustus web service > > And run Maker again > > Is this a reasonable procedure? Or am I missing some important aspects here? > Thanks in advance? > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.hoeppner at imbim.uu.se Fri Feb 20 07:53:38 2015 From: marc.hoeppner at imbim.uu.se (=?utf-8?B?TWFyYyBIw7ZwcG5lcg==?=) Date: Fri, 20 Feb 2015 14:53:38 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? Message-ID: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Hi, we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: single_exon=1 single_length=100 I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). Maker version is 2.31-8 Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? Cheers, Marc Marc P. Hoeppner, PhD Team Leader Department for Medical Biochemistry and Microbiology Uppsala University, Sweden marc.hoeppner at imbim.uu.se From dence at genetics.utah.edu Fri Feb 20 11:01:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Fri, 20 Feb 2015 18:01:06 +0000 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: Hi Marc, Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. ~Daniel > On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: > > Hi, > > we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: > > single_exon=1 > single_length=100 > > I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. > > (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). > > Maker version is 2.31-8 > > Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? > > Cheers, > > Marc > > Marc P. Hoeppner, PhD > Team Leader > Department for Medical Biochemistry and Microbiology > Uppsala University, Sweden > marc.hoeppner at imbim.uu.se > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Feb 20 11:07:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Feb 2015 11:07:28 -0700 Subject: [maker-devel] Single exon EST and UTR annotation? In-Reply-To: References: <6F6BEE43-BA1D-4849-9DC7-7DC74984939E@imbim.uu.se> Message-ID: <1DB3F36E-CAAE-4394-B6B6-53009E9566B4@gmail.com> Actually , the issue is that single exon genes have a higher threshold to meet to get UTR. Also MAKER will never add spliced UTR to a single exon gene, so an EST would also have to be single exon and encompass the entire single exon gene to get UTR. This is done because EST and mRNA-seq data in general is noisy enough that you will get mostly false UTR annotations otherwise. So it is an overly conservative approach because it?s the best of all the bad options. For spliced genes, the splice site can be used to confirm concordance of the UTR with the gene, but that can?t be done with single exon calls. ?Carson > On Feb 20, 2015, at 11:01 AM, Daniel Ence wrote: > > Hi Marc, > > Does Maker annotate the single-exon genes and miss the UTRs or is it missing the single-exon genes entirely? > > There?s another option in the control file called ?correct_est_fusion?. This option was added because of difficulties we encountered in a fungal genome annotation project where the genome had lots of overlapping UTRs in genes. The overlapping UTRs resulted in lots of fused genes, so the solution was to add this option to limit the use of ESTs in annotation genes. Basically, if ?correct_est_fusion? is turned on, Maker won?t annotate UTRs that would result in fused gene models. > > Are these single-exon genes really close together? That and the setting that I discussed above could explain the lack of UTRs. > > ~Daniel > > > > > >> On Feb 20, 2015, at 7:53 AM, Marc H?ppner wrote: >> >> Hi, >> >> we are currently annotating a fungus with essentially no introns. Training of augustus was performed on an ?evidence? build and resulting performance of the abinit profile was very good overall. But it seems that maker never makes UTRs for these single exon genes, despite setting: >> >> single_exon=1 >> single_length=100 >> >> I can see in the resulting gff files that the cufflinks transcripts were processed through and visual inspection of the entire data set in WebApollo suggests that we should see UTRs for most genes. Yet there are none. >> >> (we have used Maker before, so basic usage is familiar, and we have never seen this issue until this project). >> >> Maker version is 2.31-8 >> >> Is it verified that the single_exon option works? Are there any other not-so-obvious reasons for the behaviour we see here? >> >> Cheers, >> >> Marc >> >> Marc P. Hoeppner, PhD >> Team Leader >> Department for Medical Biochemistry and Microbiology >> Uppsala University, Sweden >> marc.hoeppner at imbim.uu.se >> >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From jgallant at msu.edu Wed Feb 25 09:43:21 2015 From: jgallant at msu.edu (Jason Gallant) Date: Wed, 25 Feb 2015 08:43:21 -0800 (PST) Subject: [maker-devel] Evaluating Genome Annotation Message-ID: <1424882600861.a6109243@Nodemailer> Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message).? This is a denovo genome assembly, for which there is no closely related species.? As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set ?of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation.? Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round.? I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence.? Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. ? Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration).? Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. ? Is my method of HMM construction to blame? 5) Am I worried about nothing here?? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant ? Dr. Jason R. GallantAssistant Professor Room 38 Natural Sciences Department of Zoology Michigan State University East Lansing, MI 48824 jgallant at msu.edu office: 517-884-7756 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Feb 25 10:25:30 2015 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 25 Feb 2015 10:25:30 -0700 Subject: [maker-devel] Evaluating Genome Annotation In-Reply-To: <1424882600861.a6109243@Nodemailer> References: <1424882600861.a6109243@Nodemailer> Message-ID: > Here are my questions: > 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 Your first round should over-predict especially if it is based off of cufflinks results (very noisy). Your second and third rounds look about right for many organisms (both should be similar in gene count), but if you believe it is low for yours then run CEGMA to estimate your genomes completeness (i.e. if your genome is 85% complete then you expect your final number from MAKER to represent about 85% of the true number of genes). Also you may want to increase your protein database. If the refseq genes you are using represent just a subset of the 3 vertebrate genomes rather than the whole genomes of those organisms, then you will want to get a couple of full genomes to work with. Also not having a high completion level genome on vertebrates in now out of the ordinary. In lamprey (an extreme case) the low completion level actually lead to the discovery that it?s cells undergo programed somatic deletion of about 25% of the genome, and since since it?s genome was sequenced off of the somatic tissue, it was obviously missing from the assembly. > 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings That?s what you expect. The third round should show just minor improvement (AED is not a highly precise number so a difference of 1% basically means the second and third round results are identical for evidence support). The real improvement from second round to third round is the quality of the unaided SNAP models (you really only get a sense of this by using apollo to view a few contigs). Because the MAKER models are derived from evidence based hints, they will always be similar between runs, but the raw SNAP models in round 3 will be much more like the MAKER models that the unaided SNAP models from round 2. This convergence helps you know that you gene predictor is trained. You may also want to train Augustus and add that to your set of predictors (look for convergence between MAKER, SNAP, and Augustus models to indicate training has worked). Augustus generally performs better on vertebrates than SNAP. On some vertebrates you actually have to just drop SNAP completely (SNAP runs very poorly on the human genome for example). On genomes where you drop SNAP then you would just use Augustus (look at evidence alignments and convergence between MAKER/SNAP/Augustus models to make that decision). > 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? The default threshold for consideration is 1 bp. But when you actually run the predictors you will realize that they cannot physically put a multi exon gene in contigs bellow about 10kp in length. So MAKER will run them, but you just won?t get any results. > 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? Your HMM?s are probably fine (look for a convergence between SNAP raw and MAKER evidence based models to see if SNAP is behaving well). I think you probably need a better protein database, perhaps need to improve repeat masking as well (try running repeat modeler - I can?t overstate the importance of this since repeats can essentially break a gene predictor). Try adding Augustus to the analysis. Also in general, I?ve found that cufflinks processed evidence is far too noisy and it adversely affects results of annotation. Try processing the transcript data with Trinity instead (you will get better gene models). I doubt additional training of SNAP is necessary. > 5) Am I worried about nothing here? Is this a pretty decent annotation? A reasonable expectation of accuracy for a first draft genome is probably in the upper 70?s to high 80?s. Extremely high quality assemblies with lots of good transcript data might break into the 90?s. For example more than 40% of the genes from the original draft of the mouse genome have since been thrown out over time (http://www.biomedcentral.com/1471-2105/10/67 ). The total gene count has remained similar, but those counts are actually based off of new genes in new locations in the genome. Also the honeybee genome recently got major improvements in there annotations (50% increase in gene count) after fixing problems with the original assembly and annotation process (http://www.biomedcentral.com/1471-2164/15/86 ). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.r.gallant at gmail.com Wed Feb 25 09:40:40 2015 From: jason.r.gallant at gmail.com (Jason Gallant) Date: Wed, 25 Feb 2015 11:40:40 -0500 Subject: [maker-devel] Evaluating Genome Annotation Message-ID: Hi Folks, I'm in the process of evaluating the genome annotation that I produced using AWS (see earlier message). This is a denovo genome assembly, for which there is no closely related species. As such, I followed the standard procedure using only SNAP (for starters). genome=4,668 scaffolds with N50> 1.7mb custom repeatmasker database Round1: est2genome=1 protein2genome=1 Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) Using this entire set of genes, I created a SNAP HMM, following the online tutorial, and ran a second round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round1.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I used the resulting genes to train a second SNAP HMM, as suggested by the tutorial and ran a third round of Maker: Round 2: est2genome=0 protein2genome=0 snaphmm=round2.hmm Concatenated refseq proteins from 3 vertebrates Cufflinks assembly of 12 tissues (~253,000,000 reads) I'm concerned that the multiple iterations did not really improve my annotation. Here are some of the metrics that I've been able to calculate thus far: Using Fathom: Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over 1147 sequences Using the AED_cdf_generator.pl script, I was able to calculate the cumulative AED scores for each round. I suspect on the first round, this distribution is meaningless since the gene models are calculated directly from evidence. Interestingly, rounds 2 and 3 had remarkably similar AED scores throughout the table, in Round 2, 92% of my genes had an AED score of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or lower. Here are my questions: 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration). Should I lower this threshold so that more of the genome is considered? 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA. Is my method of HMM construction to blame? 5) Am I worried about nothing here? Is this a pretty decent annotation? Thanks for any input you folks are able to provide! Happy annotating! Jason Gallant -------------- next part -------------- An HTML attachment was scrubbed... URL: