From kai.kamm at ecolevol.de Thu Mar 5 10:47:02 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Thu, 05 Mar 2015 17:47:02 +0100 Subject: [maker-devel] Better resolve conflicting gene models Message-ID: <54F88886.9010004@ecolevol.de> Hello, thanks for your previous advice. (Btw, how can one reply to an existing thread such that the reply will be added to the same thread?) I am trying to find the best parameters with Maker for the annotation of my genome. I have run Maker with several combinations of parameters and predictors on my three biggest scaffolds and looked at the results in Jbrowse. Overall most predictions seem fine, but there are some genes with conflicts and I have no idea why. I have: - 100Mb assembled genome - Trinity RNAseq assembly - cufflinks data (in my case don't seem to be messy as suggested, rather a good complement to the trinity data)) - protein evidence (related and unrelated species) - repeat library from repeat modeler Gene predictors used: - Augustus trained with transcripts from related species: seems to perform fine - SNAP: no convergence with Augustus even after second training. Dropped it because it predicted lots of additional low quality transcripts and sometimes disrupted final Maker transcripts. - Genemark: converged with Augustus after training (introns received from TopHat2 output). Tends to predict some additional transcripts (compared to Augustus). Few (but some) of these are covered by evidence and thus become final Maker transcripts. So the combination of Augustus and Genemark seems optimal. In general both perform well in Maker and tend to predict the same transcripts. However, I still observe some problems in the behavior of Maker which I don't understand: Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. So I thought Augustus seems a little more accurate and run Maker only with Augustus to resolve such conflicts, even though I would loose the few additional transcripts from Genemark. This is what happened: - The gene in Example 2 now has all the 17 exons. This is good! - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. I don't worry about the minor differences. The extreme cases are like two genes in a hundred and I don't understand the behavior. I was thinking that in case of conflicting models Maker will choose the one that best fits the evidence. Obviously with most conflicts this is what happens, because the majority of the final models look OK. But not the above mentioned cases and I don't understand why? Is there any parameter I missed to better resolve such conflicts? Best From bmoore at genetics.utah.edu Thu Mar 5 18:20:52 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Fri, 6 Mar 2015 00:20:52 +0000 Subject: [maker-devel] Maker Software Question In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> Message-ID: Hi Chris, I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply. I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers. MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output. This process is described in the recent paper: Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. http://www.ncbi.nlm.nih.gov/pubmed/25501943 Mike do you have a PDF of the final print version of that you could send directly to Christopher? B On Jan 16, 2015, at 8:38 AM, Seabury, Christopher > wrote: Dear Colleagues, I would like to quickly ask about a specific routine/possible function in MAKER. Previously, we have essentially made home-made versions of maker by way of Multi-step programming. At present we are exploring MAKER but are wondering IF MAKER has the ability to populate the GFF with GENE/Protein ID information? As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs, And a reference, but do not see the GENE/Protein ID in the GFF. Is there a subroutine For this, or option we have missed? Thanks and Kind Regards, Christopher M. Seabury PhD Associate Professor Department of Veterinary Pathobiology College of Veterinary Medicine Texas A&M University College Station, TX 77843-4467 cseabury at cvm.tamu.edu Mobile: 979-492-6400 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Mar 9 13:12:10 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 9 Mar 2015 18:12:10 +0000 Subject: [maker-devel] Does the maker google forum works? -[Doubt] maker2zff line 109 In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu> Hi Javier, The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there. Please join the official MAKER mailing list at: http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Thanks, B On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA > wrote: Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From javiersg at um.es Mon Mar 9 17:27:00 2015 From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA) Date: Mon, 09 Mar 2015 23:27:00 +0100 Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es> Good night everyone.I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. The last line of the gff file is the line which the mistake alert said ?that it doesnt find the file or directory. ../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF LINE 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 12 14:50:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 13:50:44 -0600 Subject: [maker-devel] Better resolve conflicting gene models In-Reply-To: <54F88886.9010004@ecolevol.de> References: <54F88886.9010004@ecolevol.de> Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com> Sorry for the slow reply. > how can one reply to an existing thread such that the reply will be added to the same thread? Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread. > Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? The gene chosen by MAKER is the one that best matches the evidence. This is a numeric value called AED (lower means better match). If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized. If a model fails to predict a base pair that is supported by evidence then it will also be penalized. The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score). Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen. > > Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. > > - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. > Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. The model chosen will always be the one with the lowest AED. The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score. I would also recommend not including cufflinks output if you have trinity data. Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t. Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence. ?Carson From carsonhh at gmail.com Thu Mar 12 15:03:11 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 14:03:11 -0600 Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires approval In-Reply-To: References: Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com> Hi Arne, The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer. So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb. ?Carson > On Mar 12, 2015, at 10:38 AM > > From: Van Hoeck Arne > > To: "maker-devel at yandell-lab.org " > > Subject: TACC lonestar and N50 value > Date: March 12, 2015 at 10:38:42 AM MDT > > > Dear MAKER developer, > > We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) > > Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? > > Best regards > Arne > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > > > Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9 > From: maker-devel-request at yandell-lab.org > Date: March 12, 2015 at 10:38:50 AM MDT > > > If you reply to this message, keeping the Subject: header intact, > Mailman will discard the held message. Do this if the message is > spam. If you reply to this message and include an Approved: header > with the list password in it, the message will be approved for posting > to the list. The Approved: header can also appear in the first line > of the body of the reply. -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Mar 12 11:38:42 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 12 Mar 2015 16:38:42 +0000 Subject: [maker-devel] TACC lonestar and N50 value Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 15:50:33 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 13:50:33 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 16:48:46 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 14:48:46 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Mar 13 19:14:52 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sat, 14 Mar 2015 00:14:52 +0000 Subject: [maker-devel] Question about pre-masked genome. In-Reply-To: References: Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu> Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them. With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible. Let us know how it works out though! Thanks, Daniel On Mar 13, 2015, at 3:48 PM, Marc Tollis > wrote: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- Marc Tollis, Ph.D. Post-Doctoral Research Associate Arizona State University LSE 313 (480) 965-7456 marc.tollis at asu.edu website: https://sites.google.com/site/tollisresearch/ blog: anolistollis.wordpress.com _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Sun Mar 15 09:19:37 2015 From: mtollis at asu.edu (Marc Tollis) Date: Sun, 15 Mar 2015 07:19:37 -0700 Subject: [maker-devel] control file for SNAP training Message-ID: This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? ? -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Mon Mar 16 08:29:36 2015 From: steinj at cshl.edu (Stein, Joshua) Date: Mon, 16 Mar 2015 13:29:36 +0000 Subject: [maker-devel] TACC lonestar and N50 value In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu> Hi Arne, I have experience with iPlant resources and with MAKER-P. I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6). You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work. Please see this tutorial, which includes information on running in MPI-mode: https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial. You can also access the TACC Lonestar installation using the iPlant Discovery Environment. There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3". Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb. This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working. Depending how this goes there are plans to roll-out additional apps intended for larger genomes. Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run. Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes. Best, Josh On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne > wrote: Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From mtollis at asu.edu Tue Mar 17 16:26:44 2015 From: mtollis at asu.edu (Marc Tollis) Date: Tue, 17 Mar 2015 14:26:44 -0700 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: I answered my own question: No need to re-align proteins again - takes too long. So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis wrote: > This is a question about process, and to make sure I am doing things right > (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the > predictive proteome from a congener (well-studied and available on > Ensembl), and generated the HMM for the first round of SNAP training. As > per the 2014 tutorial, I edited the control file for this step as follows: > I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running > blastx and exonerate however. I noticed that this is because I did not > remove (or "comment out") the path to the protein.fa in the control file > (the output looks markedly different when I do comment out the protein file > - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the > scaffolds (meaning that having protein2genome=1 is to tell maker to make > evidence annotations) ? Did I do this correctly, or should I also remove > the protein.fa out of the control file for SNAP training? > ? > -- > *Marc Tollis, Ph.D.* > *Post-Doctoral Research Associate* > *Arizona State University* > *LSE 313* > *(480) 965-7456 <%28480%29%20965-7456>* > marc.tollis at asu.edu > > *website: *https://sites.google.com/site/tollisresearch/ > *blog: *anolistollis.wordpress.com > -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 17 21:47:50 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Mar 2015 20:47:50 -0600 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy. ?Carson > On Mar 17, 2015, at 3:26 PM, Marc Tollis wrote: > > I answered my own question: > No need to re-align proteins again - takes too long. > So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! > > On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis > wrote: > This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? > ? > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com > > > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Fri Mar 20 08:17:09 2015 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Fri, 20 Mar 2015 13:17:09 +0000 Subject: [maker-devel] est2genome wrong strand Message-ID: Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? Thanks, Brian Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 >contig_69 Length=108040 Score = 1043 bits (1156), Expect = 0.0 Identities = 589/592 (99%), Gaps = 3/592 (1%) Strand=Plus/Plus Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 20 09:54:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Mar 2015 08:54:28 -0600 Subject: [maker-devel] est2genome wrong strand In-Reply-To: References: Message-ID: Hi Brian, Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time). I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error). MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript). Because single exon alignments are very unreliable, they are ignored in MAKER by default. They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files). ?Carson On Mar 20, 2015, at 7:17 AM, Mack, Brian > wrote: > Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? > > Thanks, > Brian > > Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 > >contig_69 <> > Length=108040 > > Score = 1043 bits (1156), Expect = 0.0 > Identities = 589/592 (99%), Gaps = 3/592 (1%) > Strand=Plus/Plus > > Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 > > Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 > |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 > > > > 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 > 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 > 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 > 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sat Mar 21 22:27:27 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Sun, 22 Mar 2015 14:27:27 +1100 Subject: [maker-devel] annotation stats: repeats Message-ID: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Sun Mar 22 00:56:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sun, 22 Mar 2015 05:56:06 +0000 Subject: [maker-devel] annotation stats: repeats In-Reply-To: References: Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu> Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries. The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it. Let us know whether that helps. Thanks, Daniel On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos > wrote: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos PhD Candidate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 03:29:14 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 09:29:14 +0100 Subject: [maker-devel] Augustus retraining Message-ID: Hello All, I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl" step), I get a warning for each gene that doesn't contain a start or stop codon. ..... gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? .... Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Mar 24 07:06:25 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 24 Mar 2015 23:06:25 +1100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Panos, Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. Cheers, 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and > realized that quite a few of the gene models I get based on EST data are > incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right > after the time-consuming "optimize_augustus.pl" step), I get a warning > for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon > does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon > doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene > models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here > . > The official > web > page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 07:24:45 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 13:24:45 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Xabier, Thanks for your quick reply! No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). P On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very > convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and >> realized that quite a few of the gene models I get based on EST data are >> incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right >> after the time-consuming "optimize_augustus.pl" step), I get a warning >> for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >> does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene >> models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here >> . >> The official >> web >> page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez Campos > *PhD Candidate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 09:14:51 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:14:51 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Hi Panos, EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. ?Carson > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 09:31:04 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 15:31:04 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Hi Carson, So you think it's okay to include incomplete gene models when training Augustus? I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... Thanks, Panos On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a > first round of training you can run MAKER together with protein and EST > evidence and the newly trained Augustus species file. Because MAKER gives > hints to Augustus as it runs, the models it produces will be improved over > what it would get from just running Augustus on it?s own. Then take these > gene models and use them to retrain Augustus. This is the standard > bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is > similar to Augustus) ?> http://weatherby.genetics. > utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ > Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> > http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training > files (MAKER comes with a tool that converts GFF3 for SNAP training so just > take that and convert it for Augustus)?> https://github.com/ > hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use > the legacy stand alone version), and then convert that file for bootstrap > training. > > ?Carson > > > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks > like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over > each step. Also, I have done the entire training procedure with less genes, > but didn't get a good gene-level sensitivity (~5%). So now I'm trying to > replicate it using more of my scaffolds, but as it appears I get a lot more > incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very >> convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >> >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and >>> realized that quite a few of the gene models I get based on EST data are >>> incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right >>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>> for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>> does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene >>> models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here >>> . >>> The official >>> web >>> page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 09:39:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:39:20 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). ?Carson > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. > > ?Carson > > >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 10:05:54 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 16:05:54 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. P On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt wrote: > On your first round it is fine. It gives the predictor enough to work > with, then on the second round you use improved models. When you say 6% > sensitivity is that Augustus running on it?s own? If it?s inside of MAKER > that means you are not providing sufficient protein evidence (you need the > full proteome of at least two related species). Also is that the gene > level, exon level, or nucleotide level sensitivity. If you are looking at > the gene level sensitivity measure, you only get a match when you perfectly > match all transcripts in a gene (models that may not be correct in the > first place). This value will rarely go above 10% for any predictor. You > need to use the nucleotide level sensitivity/specificity metrics. The gene > and exon level metrics are basically meaningless (unless it?s Drosophila > which is the only species annotated correctly enough to use them). > > ?Carson > > > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training > Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I > did it for SNAP, for some weird reason I forgot it for Augustus :p Do you > think, however, that I can get a big improvement in gene-level sensitivity? > Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a >> first round of training you can run MAKER together with protein and EST >> evidence and the newly trained Augustus species file. Because MAKER gives >> hints to Augustus as it runs, the models it produces will be improved over >> what it would get from just running Augustus on it?s own. Then take these >> gene models and use them to retrain Augustus. This is the standard >> bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is >> similar to Augustus) ?> http://weatherby.genetics. >> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ >> Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training >> files (MAKER comes with a tool that converts GFF3 for SNAP training so just >> take that and convert it for Augustus)?> https://github.com/ >> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use >> the legacy stand alone version), and then convert that file for bootstrap >> training. >> >> ?Carson >> >> >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis >> wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks >> like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over >> each step. Also, I have done the entire training procedure with less genes, >> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to >> replicate it using more of my scaffolds, but as it appears I get a lot more >> incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < >> xvazquezc at gmail.com> wrote: >> >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very >>> convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >>> >>>> Hello All, >>>> >>>> I'm trying to retrain Augustus using EST data from the same species and >>>> realized that quite a few of the gene models I get based on EST data are >>>> incomplete (i.e. no start and/or stop codon). >>>> >>>> Now, when I get to the "etraining" step in Augustus retraining (right >>>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>>> for each gene that doesn't contain a start or stop codon. >>>> >>>> ..... >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>>> does not begin with start codon but with acg >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>>> .... >>>> >>>> Does anyone know whether training is compromised by such incomplete >>>> gene models? Do you usually exclude them from the training set? >>>> >>>> Oh, and by the way, the best guide to retraining Augustus is here >>>> . >>>> The official >>>> >>>> web page isn't bad, but doesn't explain in detail certain things. >>>> >>>> Thanks, >>>> Panos >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> *PhD Candidate* >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 10:38:08 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 09:38:08 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com> I?d pick a couple of species that are as closely related as you can find. Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative). The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with. Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point. This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics. Thanks, Carson > On Mar 24, 2015, at 9:05 AM, Panos Ioannidis wrote: > > Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. > > I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). > > Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. > > P > > On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt > wrote: > On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). > > ?Carson > > >> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: >> >> Hi Carson, >> >> So you think it's okay to include incomplete gene models when training Augustus? >> >> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... >> >> Thanks, >> Panos >> >> >> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. >> >> ?Carson >> >> >>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >>> >>> Hi Xabier, >>> >>> Thanks for your quick reply! >>> >>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >>> >>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >>> >>> P >>> >>> >>> >>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alicebdennis at gmail.com Thu Mar 26 05:34:26 2015 From: alicebdennis at gmail.com (Alice Dennis) Date: Thu, 26 Mar 2015 11:34:26 +0100 Subject: [maker-devel] iterative Maker2 In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hello again, I posted a while ago about a genome I'm running through the Maker2 pipeline. I was concerned because my results were still changing with 3 and 4 iterations. Following the very useful advice of Carson (below), I've made a few modifications (adding a RepeatModeler run, using a big protein database), but my gene predictions are still changing between the 3rd and 4th iterations. Perhaps this is ok, but these increasing gene lengths make me worry that I haven't built stable models. Here is the short version of what I've done. 1. Run RepeatModeler, but this only produced 47 sequences in the resulting .fasta... so that seemed a bit small. 2. Run Maker2 using: - RepeatModeler output + "model_org=all" and "softmask=1" in the Repeat Masking section. - protein evidence from 2 distantly related species AND all of Uniprot - ests from a different strain of my species (a parasitoid wasp) - the .hmm from Nasonia, one of the 2 distantly related species whose proteome I also provided as protein evidence - my assembled genome of 1,509 scaffolds. 3. After this, I did three subsequent rounds of Maker2 (cleverly named Rounds 2, 3 and 4). Each one used the same input, except the Nasonia .hmm was replaced by a SNAP generated .hmm from the previous round. Also, the est2genome and protein2genome was changed from 1 to 0 in all runs after the first. Here are some results: Round1: 14,647 genes, average length 2,491 Round2: 12,158 genes, average length 3,760 Round3: 13,515 genes, average length 3,090 Round4: 12,169 genes, average length 3,918 This is a bit confusing because the number of genes predicted goes up and down, as does their lengths. I've doubly checked the dates of my files, and they are all labeled such that I don't think anything could be swapped. So my questions are: Is this an indication that my models are unstable and I shouldn't trust these predictions? Is the decreasing number of genes, while also getting longer perhaps a good thing? How do I know when to stop if genes keep getting longer? Thanks very much, Alice On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > The gene models are actually produced by SNAP, Augustus, or whatever gene > predictor you are using, so if you change the HMM every round, then the > models will change too. But I have one concern. You are using a very > sparse protein evidence dataset. The protein dataset is very important to > MAKER?s performance, and for itterative training of the ab initio > predictors. Normally after the second iteration, additional training should > not be beneficial, but if you are getting wildly different results on 3rd > and 4th round, then you probably aren?t getting sufficient good models to > train with. > > For a protein dataset you should be using the entire a proteome from a > minimum of two related species and perhaps all of UniProt/Swiss-prot to get > a broad protein database. Don?t use the proteins extracted by CEGMA and > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff scrip > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > also the HaMSTr results will be redundant with the ESTs. You need proteins > from related species to look for homology not found in the EST dataset. > > Also repeat masking is important for any genome and has a huge effect on ab > initio predictor performance. Make sure you run something like > RepeatModeler to look for species specific repeats that will not already be > in RepBase. Then add those results to the rmlib= option in the maker > control files. > > Thanks, > Carson > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice wrote: > > Hi all, > > I am a relatively new user to Maker2, and I?m looking for advise on running > many iterations of the same dataset in Maker2. > > I have a relatively small genome (~124 MB) from a wasp that is assembled > into ~1,500 scaffold. I have run several iterations of Maker2 by > re-generating .hmms in SNAP and feeding them into the next round, and my > gene predictions keep increasing (in number and in size). The only thing > that changes at each round is the .hmm. > This is the evidence that I give is: > - de novo assembled ESTs from a different strain of the same > species (70,000 contigs? I am currently working on improving this assembly > with the hope that this will be helpful here) > - 610 proteins extracted from the genome scaffolds using CEGMA and > HaMSTr > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > est2genome/protein2genome option. > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > previous round, all without the est2genome/protein2genome option. All other > files are the same as in the original run. > > As I understand it, after the second round, nothing should change in Maker2. > But the differences are obvious between runs. Some entirely new exons are > annotated. For example, just counting ?exon? in the .gff file gives me > 73,000 after the third iteration and 96,000 after the fourth! Actually the > biggest leap in this number is between the third and fourth round. I can > also see that many features are longer when I look at the files in Geneious. > > Is this sort of change possible after the second round of Maker2? Is there > something I have done wrong in my runs, or am a understanding this output > incorrectly? > > Thank you, > Alice > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Alice Dennis alicebdennis at gmail.com Postdoctoral Researcher Institute for Integrative Biology, ETH Z?rich & EAWAG ?berlandstrasse 133 P.O. Box 611 8600 D?bendorf, Switzerland https://adennis5.wordpress.com/ From michael.s.campbell1 at gmail.com Thu Mar 26 10:50:41 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 26 Mar 2015 09:50:41 -0600 Subject: [maker-devel] iterative Maker2 In-Reply-To: References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hi Alice, In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening. Mike On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis wrote: > Hello again, > > I posted a while ago about a genome I'm running through the Maker2 > pipeline. I was concerned because my results were still changing with > 3 and 4 iterations. > > Following the very useful advice of Carson (below), I've made a few > modifications (adding a RepeatModeler run, using a big protein > database), but my gene predictions are still changing between the 3rd > and 4th iterations. Perhaps this is ok, but these increasing gene > lengths make me worry that I haven't built stable models. > > Here is the short version of what I've done. > 1. Run RepeatModeler, but this only produced 47 sequences in the > resulting .fasta... so that seemed a bit small. > > 2. Run Maker2 using: > - RepeatModeler output + "model_org=all" and "softmask=1" in the > Repeat Masking section. > - protein evidence from 2 distantly related species AND all of Uniprot > - ests from a different strain of my species (a parasitoid wasp) > - the .hmm from Nasonia, one of the 2 distantly related species whose > proteome I also provided as protein evidence > - my assembled genome of 1,509 scaffolds. > > 3. After this, I did three subsequent rounds of Maker2 (cleverly named > Rounds 2, 3 and 4). Each one used the same input, except the Nasonia > .hmm was replaced by a SNAP generated .hmm from the previous round. > Also, the est2genome and protein2genome was changed from 1 to 0 in all > runs after the first. > > Here are some results: > Round1: 14,647 genes, average length 2,491 > Round2: 12,158 genes, average length 3,760 > Round3: 13,515 genes, average length 3,090 > Round4: 12,169 genes, average length 3,918 > > This is a bit confusing because the number of genes predicted goes up > and down, as does their lengths. I've doubly checked the dates of my > files, and they are all labeled such that I don't think anything could > be swapped. > > So my questions are: > Is this an indication that my models are unstable and I shouldn't > trust these predictions? > Is the decreasing number of genes, while also getting longer perhaps a > good thing? > How do I know when to stop if genes keep getting longer? > > > Thanks very much, > Alice > > > On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > > The gene models are actually produced by SNAP, Augustus, or whatever gene > > predictor you are using, so if you change the HMM every round, then the > > models will change too. But I have one concern. You are using a very > > sparse protein evidence dataset. The protein dataset is very important > to > > MAKER?s performance, and for itterative training of the ab initio > > predictors. Normally after the second iteration, additional training > should > > not be beneficial, but if you are getting wildly different results on 3rd > > and 4th round, then you probably aren?t getting sufficient good models to > > train with. > > > > For a protein dataset you should be using the entire a proteome from a > > minimum of two related species and perhaps all of UniProt/Swiss-prot to > get > > a broad protein database. Don?t use the proteins extracted by CEGMA and > > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff > scrip > > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > > also the HaMSTr results will be redundant with the ESTs. You need > proteins > > from related species to look for homology not found in the EST dataset. > > > > Also repeat masking is important for any genome and has a huge effect on > ab > > initio predictor performance. Make sure you run something like > > RepeatModeler to look for species specific repeats that will not already > be > > in RepBase. Then add those results to the rmlib= option in the maker > > control files. > > > > Thanks, > > Carson > > > > > > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice > wrote: > > > > Hi all, > > > > I am a relatively new user to Maker2, and I?m looking for advise on > running > > many iterations of the same dataset in Maker2. > > > > I have a relatively small genome (~124 MB) from a wasp that is assembled > > into ~1,500 scaffold. I have run several iterations of Maker2 by > > re-generating .hmms in SNAP and feeding them into the next round, and my > > gene predictions keep increasing (in number and in size). The only thing > > that changes at each round is the .hmm. > > This is the evidence that I give is: > > - de novo assembled ESTs from a different strain of the same > > species (70,000 contigs? I am currently working on improving this > assembly > > with the hope that this will be helpful here) > > - 610 proteins extracted from the genome scaffolds using CEGMA > and > > HaMSTr > > > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > > est2genome/protein2genome option. > > > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > > previous round, all without the est2genome/protein2genome option. All > other > > files are the same as in the original run. > > > > As I understand it, after the second round, nothing should change in > Maker2. > > But the differences are obvious between runs. Some entirely new exons are > > annotated. For example, just counting ?exon? in the .gff file gives me > > 73,000 after the third iteration and 96,000 after the fourth! Actually > the > > biggest leap in this number is between the third and fourth round. I can > > also see that many features are longer when I look at the files in > Geneious. > > > > Is this sort of change possible after the second round of Maker2? Is > there > > something I have done wrong in my runs, or am a understanding this output > > incorrectly? > > > > Thank you, > > Alice > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > > > -- > > > Alice Dennis > alicebdennis at gmail.com > > Postdoctoral Researcher > Institute for Integrative Biology, ETH Z?rich & EAWAG > ?berlandstrasse 133 > P.O. Box 611 > 8600 D?bendorf, Switzerland > > https://adennis5.wordpress.com/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rens.holmer at wur.nl Mon Mar 30 01:12:20 2015 From: rens.holmer at wur.nl (Holmer, Rens) Date: Mon, 30 Mar 2015 06:12:20 +0000 Subject: [maker-devel] Incorporating cufflinks in maker Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl> Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer From goutham.atla at gmail.com Sat Mar 28 00:37:08 2015 From: goutham.atla at gmail.com (Goutham atla) Date: Sat, 28 Mar 2015 11:07:08 +0530 Subject: [maker-devel] Annotating Cufflinks GTF with Maker Message-ID: Dear All, I have a draft genome for organism of my interest and I have around 150G of 100bp paired-end RNA-Seq data from different conditions. This organism has ensemble annotations but very few. My goal is to look at differential splicing analysis between two conditions. For this I need good annotations in gtf format at isoform level.I am interested in using the Splicing Analysis Kit For now, I have aligned one sample to genome using tophat2 and then used cufflinks to generate a de-novo GTF file. In either cases I have not used the avail be GTF with very few annotations. The GTF file generated by cufflinks should be annotated to know the function of each transcript. So I am interested in adding annotations to the gtf file generated from cufflinks. What is the best of doing it ? Or is there any better way of getting a gtf file, like that of ensemble, from my data ? I have looked at trinotate, but its more about functional annotation and expression studies. Regards, -- Goutham Atla -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Mon Mar 30 11:11:16 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Mon, 30 Mar 2015 16:11:16 +0000 Subject: [maker-devel] comments on Incorporating cufflinks in maker Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be> Dear Rens and Carlson, I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses. I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations. Best regards Arne Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Thu Mar 5 09:47:02 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Thu, 05 Mar 2015 17:47:02 +0100 Subject: [maker-devel] Better resolve conflicting gene models Message-ID: <54F88886.9010004@ecolevol.de> Hello, thanks for your previous advice. (Btw, how can one reply to an existing thread such that the reply will be added to the same thread?) I am trying to find the best parameters with Maker for the annotation of my genome. I have run Maker with several combinations of parameters and predictors on my three biggest scaffolds and looked at the results in Jbrowse. Overall most predictions seem fine, but there are some genes with conflicts and I have no idea why. I have: - 100Mb assembled genome - Trinity RNAseq assembly - cufflinks data (in my case don't seem to be messy as suggested, rather a good complement to the trinity data)) - protein evidence (related and unrelated species) - repeat library from repeat modeler Gene predictors used: - Augustus trained with transcripts from related species: seems to perform fine - SNAP: no convergence with Augustus even after second training. Dropped it because it predicted lots of additional low quality transcripts and sometimes disrupted final Maker transcripts. - Genemark: converged with Augustus after training (introns received from TopHat2 output). Tends to predict some additional transcripts (compared to Augustus). Few (but some) of these are covered by evidence and thus become final Maker transcripts. So the combination of Augustus and Genemark seems optimal. In general both perform well in Maker and tend to predict the same transcripts. However, I still observe some problems in the behavior of Maker which I don't understand: Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. So I thought Augustus seems a little more accurate and run Maker only with Augustus to resolve such conflicts, even though I would loose the few additional transcripts from Genemark. This is what happened: - The gene in Example 2 now has all the 17 exons. This is good! - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. I don't worry about the minor differences. The extreme cases are like two genes in a hundred and I don't understand the behavior. I was thinking that in case of conflicting models Maker will choose the one that best fits the evidence. Obviously with most conflicts this is what happens, because the majority of the final models look OK. But not the above mentioned cases and I don't understand why? Is there any parameter I missed to better resolve such conflicts? Best From bmoore at genetics.utah.edu Thu Mar 5 17:20:52 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Fri, 6 Mar 2015 00:20:52 +0000 Subject: [maker-devel] Maker Software Question In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> Message-ID: Hi Chris, I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply. I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers. MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output. This process is described in the recent paper: Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. http://www.ncbi.nlm.nih.gov/pubmed/25501943 Mike do you have a PDF of the final print version of that you could send directly to Christopher? B On Jan 16, 2015, at 8:38 AM, Seabury, Christopher > wrote: Dear Colleagues, I would like to quickly ask about a specific routine/possible function in MAKER. Previously, we have essentially made home-made versions of maker by way of Multi-step programming. At present we are exploring MAKER but are wondering IF MAKER has the ability to populate the GFF with GENE/Protein ID information? As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs, And a reference, but do not see the GENE/Protein ID in the GFF. Is there a subroutine For this, or option we have missed? Thanks and Kind Regards, Christopher M. Seabury PhD Associate Professor Department of Veterinary Pathobiology College of Veterinary Medicine Texas A&M University College Station, TX 77843-4467 cseabury at cvm.tamu.edu Mobile: 979-492-6400 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Mar 9 12:12:10 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 9 Mar 2015 18:12:10 +0000 Subject: [maker-devel] Does the maker google forum works? -[Doubt] maker2zff line 109 In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu> Hi Javier, The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there. Please join the official MAKER mailing list at: http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Thanks, B On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA > wrote: Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From javiersg at um.es Mon Mar 9 16:27:00 2015 From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA) Date: Mon, 09 Mar 2015 23:27:00 +0100 Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es> Good night everyone.I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. The last line of the gff file is the line which the mistake alert said ?that it doesnt find the file or directory. ../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF LINE 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 12 13:50:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 13:50:44 -0600 Subject: [maker-devel] Better resolve conflicting gene models In-Reply-To: <54F88886.9010004@ecolevol.de> References: <54F88886.9010004@ecolevol.de> Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com> Sorry for the slow reply. > how can one reply to an existing thread such that the reply will be added to the same thread? Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread. > Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? The gene chosen by MAKER is the one that best matches the evidence. This is a numeric value called AED (lower means better match). If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized. If a model fails to predict a base pair that is supported by evidence then it will also be penalized. The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score). Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen. > > Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. > > - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. > Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. The model chosen will always be the one with the lowest AED. The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score. I would also recommend not including cufflinks output if you have trinity data. Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t. Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence. ?Carson From carsonhh at gmail.com Thu Mar 12 14:03:11 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 14:03:11 -0600 Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires approval In-Reply-To: References: Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com> Hi Arne, The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer. So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb. ?Carson > On Mar 12, 2015, at 10:38 AM > > From: Van Hoeck Arne > > To: "maker-devel at yandell-lab.org " > > Subject: TACC lonestar and N50 value > Date: March 12, 2015 at 10:38:42 AM MDT > > > Dear MAKER developer, > > We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) > > Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? > > Best regards > Arne > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > > > Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9 > From: maker-devel-request at yandell-lab.org > Date: March 12, 2015 at 10:38:50 AM MDT > > > If you reply to this message, keeping the Subject: header intact, > Mailman will discard the held message. Do this if the message is > spam. If you reply to this message and include an Approved: header > with the list password in it, the message will be approved for posting > to the list. The Approved: header can also appear in the first line > of the body of the reply. -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Mar 12 10:38:42 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 12 Mar 2015 16:38:42 +0000 Subject: [maker-devel] TACC lonestar and N50 value Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 14:50:33 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 13:50:33 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 15:48:46 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 14:48:46 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Mar 13 18:14:52 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sat, 14 Mar 2015 00:14:52 +0000 Subject: [maker-devel] Question about pre-masked genome. In-Reply-To: References: Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu> Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them. With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible. Let us know how it works out though! Thanks, Daniel On Mar 13, 2015, at 3:48 PM, Marc Tollis > wrote: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- Marc Tollis, Ph.D. Post-Doctoral Research Associate Arizona State University LSE 313 (480) 965-7456 marc.tollis at asu.edu website: https://sites.google.com/site/tollisresearch/ blog: anolistollis.wordpress.com _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Sun Mar 15 08:19:37 2015 From: mtollis at asu.edu (Marc Tollis) Date: Sun, 15 Mar 2015 07:19:37 -0700 Subject: [maker-devel] control file for SNAP training Message-ID: This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? ? -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Mon Mar 16 07:29:36 2015 From: steinj at cshl.edu (Stein, Joshua) Date: Mon, 16 Mar 2015 13:29:36 +0000 Subject: [maker-devel] TACC lonestar and N50 value In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu> Hi Arne, I have experience with iPlant resources and with MAKER-P. I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6). You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work. Please see this tutorial, which includes information on running in MPI-mode: https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial. You can also access the TACC Lonestar installation using the iPlant Discovery Environment. There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3". Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb. This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working. Depending how this goes there are plans to roll-out additional apps intended for larger genomes. Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run. Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes. Best, Josh On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne > wrote: Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From mtollis at asu.edu Tue Mar 17 15:26:44 2015 From: mtollis at asu.edu (Marc Tollis) Date: Tue, 17 Mar 2015 14:26:44 -0700 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: I answered my own question: No need to re-align proteins again - takes too long. So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis wrote: > This is a question about process, and to make sure I am doing things right > (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the > predictive proteome from a congener (well-studied and available on > Ensembl), and generated the HMM for the first round of SNAP training. As > per the 2014 tutorial, I edited the control file for this step as follows: > I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running > blastx and exonerate however. I noticed that this is because I did not > remove (or "comment out") the path to the protein.fa in the control file > (the output looks markedly different when I do comment out the protein file > - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the > scaffolds (meaning that having protein2genome=1 is to tell maker to make > evidence annotations) ? Did I do this correctly, or should I also remove > the protein.fa out of the control file for SNAP training? > ? > -- > *Marc Tollis, Ph.D.* > *Post-Doctoral Research Associate* > *Arizona State University* > *LSE 313* > *(480) 965-7456 <%28480%29%20965-7456>* > marc.tollis at asu.edu > > *website: *https://sites.google.com/site/tollisresearch/ > *blog: *anolistollis.wordpress.com > -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 17 20:47:50 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Mar 2015 20:47:50 -0600 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy. ?Carson > On Mar 17, 2015, at 3:26 PM, Marc Tollis wrote: > > I answered my own question: > No need to re-align proteins again - takes too long. > So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! > > On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis > wrote: > This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? > ? > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com > > > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Fri Mar 20 07:17:09 2015 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Fri, 20 Mar 2015 13:17:09 +0000 Subject: [maker-devel] est2genome wrong strand Message-ID: Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? Thanks, Brian Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 >contig_69 Length=108040 Score = 1043 bits (1156), Expect = 0.0 Identities = 589/592 (99%), Gaps = 3/592 (1%) Strand=Plus/Plus Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 20 08:54:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Mar 2015 08:54:28 -0600 Subject: [maker-devel] est2genome wrong strand In-Reply-To: References: Message-ID: Hi Brian, Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time). I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error). MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript). Because single exon alignments are very unreliable, they are ignored in MAKER by default. They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files). ?Carson On Mar 20, 2015, at 7:17 AM, Mack, Brian > wrote: > Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? > > Thanks, > Brian > > Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 > >contig_69 <> > Length=108040 > > Score = 1043 bits (1156), Expect = 0.0 > Identities = 589/592 (99%), Gaps = 3/592 (1%) > Strand=Plus/Plus > > Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 > > Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 > |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 > > > > 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 > 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 > 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 > 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sat Mar 21 21:27:27 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Sun, 22 Mar 2015 14:27:27 +1100 Subject: [maker-devel] annotation stats: repeats Message-ID: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Sat Mar 21 23:56:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sun, 22 Mar 2015 05:56:06 +0000 Subject: [maker-devel] annotation stats: repeats In-Reply-To: References: Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu> Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries. The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it. Let us know whether that helps. Thanks, Daniel On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos > wrote: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos PhD Candidate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 02:29:14 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 09:29:14 +0100 Subject: [maker-devel] Augustus retraining Message-ID: Hello All, I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl" step), I get a warning for each gene that doesn't contain a start or stop codon. ..... gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? .... Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Mar 24 06:06:25 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 24 Mar 2015 23:06:25 +1100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Panos, Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. Cheers, 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and > realized that quite a few of the gene models I get based on EST data are > incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right > after the time-consuming "optimize_augustus.pl" step), I get a warning > for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon > does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon > doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene > models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here > . > The official > web > page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 06:24:45 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 13:24:45 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Xabier, Thanks for your quick reply! No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). P On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very > convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and >> realized that quite a few of the gene models I get based on EST data are >> incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right >> after the time-consuming "optimize_augustus.pl" step), I get a warning >> for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >> does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene >> models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here >> . >> The official >> web >> page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez Campos > *PhD Candidate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 08:14:51 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:14:51 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Hi Panos, EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. ?Carson > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 08:31:04 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 15:31:04 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Hi Carson, So you think it's okay to include incomplete gene models when training Augustus? I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... Thanks, Panos On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a > first round of training you can run MAKER together with protein and EST > evidence and the newly trained Augustus species file. Because MAKER gives > hints to Augustus as it runs, the models it produces will be improved over > what it would get from just running Augustus on it?s own. Then take these > gene models and use them to retrain Augustus. This is the standard > bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is > similar to Augustus) ?> http://weatherby.genetics. > utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ > Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> > http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training > files (MAKER comes with a tool that converts GFF3 for SNAP training so just > take that and convert it for Augustus)?> https://github.com/ > hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use > the legacy stand alone version), and then convert that file for bootstrap > training. > > ?Carson > > > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks > like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over > each step. Also, I have done the entire training procedure with less genes, > but didn't get a good gene-level sensitivity (~5%). So now I'm trying to > replicate it using more of my scaffolds, but as it appears I get a lot more > incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very >> convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >> >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and >>> realized that quite a few of the gene models I get based on EST data are >>> incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right >>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>> for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>> does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene >>> models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here >>> . >>> The official >>> web >>> page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 08:39:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:39:20 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). ?Carson > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. > > ?Carson > > >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 09:05:54 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 16:05:54 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. P On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt wrote: > On your first round it is fine. It gives the predictor enough to work > with, then on the second round you use improved models. When you say 6% > sensitivity is that Augustus running on it?s own? If it?s inside of MAKER > that means you are not providing sufficient protein evidence (you need the > full proteome of at least two related species). Also is that the gene > level, exon level, or nucleotide level sensitivity. If you are looking at > the gene level sensitivity measure, you only get a match when you perfectly > match all transcripts in a gene (models that may not be correct in the > first place). This value will rarely go above 10% for any predictor. You > need to use the nucleotide level sensitivity/specificity metrics. The gene > and exon level metrics are basically meaningless (unless it?s Drosophila > which is the only species annotated correctly enough to use them). > > ?Carson > > > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training > Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I > did it for SNAP, for some weird reason I forgot it for Augustus :p Do you > think, however, that I can get a big improvement in gene-level sensitivity? > Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a >> first round of training you can run MAKER together with protein and EST >> evidence and the newly trained Augustus species file. Because MAKER gives >> hints to Augustus as it runs, the models it produces will be improved over >> what it would get from just running Augustus on it?s own. Then take these >> gene models and use them to retrain Augustus. This is the standard >> bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is >> similar to Augustus) ?> http://weatherby.genetics. >> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ >> Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training >> files (MAKER comes with a tool that converts GFF3 for SNAP training so just >> take that and convert it for Augustus)?> https://github.com/ >> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use >> the legacy stand alone version), and then convert that file for bootstrap >> training. >> >> ?Carson >> >> >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis >> wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks >> like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over >> each step. Also, I have done the entire training procedure with less genes, >> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to >> replicate it using more of my scaffolds, but as it appears I get a lot more >> incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < >> xvazquezc at gmail.com> wrote: >> >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very >>> convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >>> >>>> Hello All, >>>> >>>> I'm trying to retrain Augustus using EST data from the same species and >>>> realized that quite a few of the gene models I get based on EST data are >>>> incomplete (i.e. no start and/or stop codon). >>>> >>>> Now, when I get to the "etraining" step in Augustus retraining (right >>>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>>> for each gene that doesn't contain a start or stop codon. >>>> >>>> ..... >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>>> does not begin with start codon but with acg >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>>> .... >>>> >>>> Does anyone know whether training is compromised by such incomplete >>>> gene models? Do you usually exclude them from the training set? >>>> >>>> Oh, and by the way, the best guide to retraining Augustus is here >>>> . >>>> The official >>>> >>>> web page isn't bad, but doesn't explain in detail certain things. >>>> >>>> Thanks, >>>> Panos >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> *PhD Candidate* >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 09:38:08 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 09:38:08 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com> I?d pick a couple of species that are as closely related as you can find. Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative). The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with. Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point. This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics. Thanks, Carson > On Mar 24, 2015, at 9:05 AM, Panos Ioannidis wrote: > > Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. > > I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). > > Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. > > P > > On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt > wrote: > On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). > > ?Carson > > >> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: >> >> Hi Carson, >> >> So you think it's okay to include incomplete gene models when training Augustus? >> >> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... >> >> Thanks, >> Panos >> >> >> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. >> >> ?Carson >> >> >>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >>> >>> Hi Xabier, >>> >>> Thanks for your quick reply! >>> >>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >>> >>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >>> >>> P >>> >>> >>> >>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alicebdennis at gmail.com Thu Mar 26 04:34:26 2015 From: alicebdennis at gmail.com (Alice Dennis) Date: Thu, 26 Mar 2015 11:34:26 +0100 Subject: [maker-devel] iterative Maker2 In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hello again, I posted a while ago about a genome I'm running through the Maker2 pipeline. I was concerned because my results were still changing with 3 and 4 iterations. Following the very useful advice of Carson (below), I've made a few modifications (adding a RepeatModeler run, using a big protein database), but my gene predictions are still changing between the 3rd and 4th iterations. Perhaps this is ok, but these increasing gene lengths make me worry that I haven't built stable models. Here is the short version of what I've done. 1. Run RepeatModeler, but this only produced 47 sequences in the resulting .fasta... so that seemed a bit small. 2. Run Maker2 using: - RepeatModeler output + "model_org=all" and "softmask=1" in the Repeat Masking section. - protein evidence from 2 distantly related species AND all of Uniprot - ests from a different strain of my species (a parasitoid wasp) - the .hmm from Nasonia, one of the 2 distantly related species whose proteome I also provided as protein evidence - my assembled genome of 1,509 scaffolds. 3. After this, I did three subsequent rounds of Maker2 (cleverly named Rounds 2, 3 and 4). Each one used the same input, except the Nasonia .hmm was replaced by a SNAP generated .hmm from the previous round. Also, the est2genome and protein2genome was changed from 1 to 0 in all runs after the first. Here are some results: Round1: 14,647 genes, average length 2,491 Round2: 12,158 genes, average length 3,760 Round3: 13,515 genes, average length 3,090 Round4: 12,169 genes, average length 3,918 This is a bit confusing because the number of genes predicted goes up and down, as does their lengths. I've doubly checked the dates of my files, and they are all labeled such that I don't think anything could be swapped. So my questions are: Is this an indication that my models are unstable and I shouldn't trust these predictions? Is the decreasing number of genes, while also getting longer perhaps a good thing? How do I know when to stop if genes keep getting longer? Thanks very much, Alice On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > The gene models are actually produced by SNAP, Augustus, or whatever gene > predictor you are using, so if you change the HMM every round, then the > models will change too. But I have one concern. You are using a very > sparse protein evidence dataset. The protein dataset is very important to > MAKER?s performance, and for itterative training of the ab initio > predictors. Normally after the second iteration, additional training should > not be beneficial, but if you are getting wildly different results on 3rd > and 4th round, then you probably aren?t getting sufficient good models to > train with. > > For a protein dataset you should be using the entire a proteome from a > minimum of two related species and perhaps all of UniProt/Swiss-prot to get > a broad protein database. Don?t use the proteins extracted by CEGMA and > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff scrip > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > also the HaMSTr results will be redundant with the ESTs. You need proteins > from related species to look for homology not found in the EST dataset. > > Also repeat masking is important for any genome and has a huge effect on ab > initio predictor performance. Make sure you run something like > RepeatModeler to look for species specific repeats that will not already be > in RepBase. Then add those results to the rmlib= option in the maker > control files. > > Thanks, > Carson > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice wrote: > > Hi all, > > I am a relatively new user to Maker2, and I?m looking for advise on running > many iterations of the same dataset in Maker2. > > I have a relatively small genome (~124 MB) from a wasp that is assembled > into ~1,500 scaffold. I have run several iterations of Maker2 by > re-generating .hmms in SNAP and feeding them into the next round, and my > gene predictions keep increasing (in number and in size). The only thing > that changes at each round is the .hmm. > This is the evidence that I give is: > - de novo assembled ESTs from a different strain of the same > species (70,000 contigs? I am currently working on improving this assembly > with the hope that this will be helpful here) > - 610 proteins extracted from the genome scaffolds using CEGMA and > HaMSTr > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > est2genome/protein2genome option. > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > previous round, all without the est2genome/protein2genome option. All other > files are the same as in the original run. > > As I understand it, after the second round, nothing should change in Maker2. > But the differences are obvious between runs. Some entirely new exons are > annotated. For example, just counting ?exon? in the .gff file gives me > 73,000 after the third iteration and 96,000 after the fourth! Actually the > biggest leap in this number is between the third and fourth round. I can > also see that many features are longer when I look at the files in Geneious. > > Is this sort of change possible after the second round of Maker2? Is there > something I have done wrong in my runs, or am a understanding this output > incorrectly? > > Thank you, > Alice > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Alice Dennis alicebdennis at gmail.com Postdoctoral Researcher Institute for Integrative Biology, ETH Z?rich & EAWAG ?berlandstrasse 133 P.O. Box 611 8600 D?bendorf, Switzerland https://adennis5.wordpress.com/ From michael.s.campbell1 at gmail.com Thu Mar 26 09:50:41 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 26 Mar 2015 09:50:41 -0600 Subject: [maker-devel] iterative Maker2 In-Reply-To: References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hi Alice, In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening. Mike On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis wrote: > Hello again, > > I posted a while ago about a genome I'm running through the Maker2 > pipeline. I was concerned because my results were still changing with > 3 and 4 iterations. > > Following the very useful advice of Carson (below), I've made a few > modifications (adding a RepeatModeler run, using a big protein > database), but my gene predictions are still changing between the 3rd > and 4th iterations. Perhaps this is ok, but these increasing gene > lengths make me worry that I haven't built stable models. > > Here is the short version of what I've done. > 1. Run RepeatModeler, but this only produced 47 sequences in the > resulting .fasta... so that seemed a bit small. > > 2. Run Maker2 using: > - RepeatModeler output + "model_org=all" and "softmask=1" in the > Repeat Masking section. > - protein evidence from 2 distantly related species AND all of Uniprot > - ests from a different strain of my species (a parasitoid wasp) > - the .hmm from Nasonia, one of the 2 distantly related species whose > proteome I also provided as protein evidence > - my assembled genome of 1,509 scaffolds. > > 3. After this, I did three subsequent rounds of Maker2 (cleverly named > Rounds 2, 3 and 4). Each one used the same input, except the Nasonia > .hmm was replaced by a SNAP generated .hmm from the previous round. > Also, the est2genome and protein2genome was changed from 1 to 0 in all > runs after the first. > > Here are some results: > Round1: 14,647 genes, average length 2,491 > Round2: 12,158 genes, average length 3,760 > Round3: 13,515 genes, average length 3,090 > Round4: 12,169 genes, average length 3,918 > > This is a bit confusing because the number of genes predicted goes up > and down, as does their lengths. I've doubly checked the dates of my > files, and they are all labeled such that I don't think anything could > be swapped. > > So my questions are: > Is this an indication that my models are unstable and I shouldn't > trust these predictions? > Is the decreasing number of genes, while also getting longer perhaps a > good thing? > How do I know when to stop if genes keep getting longer? > > > Thanks very much, > Alice > > > On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > > The gene models are actually produced by SNAP, Augustus, or whatever gene > > predictor you are using, so if you change the HMM every round, then the > > models will change too. But I have one concern. You are using a very > > sparse protein evidence dataset. The protein dataset is very important > to > > MAKER?s performance, and for itterative training of the ab initio > > predictors. Normally after the second iteration, additional training > should > > not be beneficial, but if you are getting wildly different results on 3rd > > and 4th round, then you probably aren?t getting sufficient good models to > > train with. > > > > For a protein dataset you should be using the entire a proteome from a > > minimum of two related species and perhaps all of UniProt/Swiss-prot to > get > > a broad protein database. Don?t use the proteins extracted by CEGMA and > > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff > scrip > > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > > also the HaMSTr results will be redundant with the ESTs. You need > proteins > > from related species to look for homology not found in the EST dataset. > > > > Also repeat masking is important for any genome and has a huge effect on > ab > > initio predictor performance. Make sure you run something like > > RepeatModeler to look for species specific repeats that will not already > be > > in RepBase. Then add those results to the rmlib= option in the maker > > control files. > > > > Thanks, > > Carson > > > > > > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice > wrote: > > > > Hi all, > > > > I am a relatively new user to Maker2, and I?m looking for advise on > running > > many iterations of the same dataset in Maker2. > > > > I have a relatively small genome (~124 MB) from a wasp that is assembled > > into ~1,500 scaffold. I have run several iterations of Maker2 by > > re-generating .hmms in SNAP and feeding them into the next round, and my > > gene predictions keep increasing (in number and in size). The only thing > > that changes at each round is the .hmm. > > This is the evidence that I give is: > > - de novo assembled ESTs from a different strain of the same > > species (70,000 contigs? I am currently working on improving this > assembly > > with the hope that this will be helpful here) > > - 610 proteins extracted from the genome scaffolds using CEGMA > and > > HaMSTr > > > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > > est2genome/protein2genome option. > > > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > > previous round, all without the est2genome/protein2genome option. All > other > > files are the same as in the original run. > > > > As I understand it, after the second round, nothing should change in > Maker2. > > But the differences are obvious between runs. Some entirely new exons are > > annotated. For example, just counting ?exon? in the .gff file gives me > > 73,000 after the third iteration and 96,000 after the fourth! Actually > the > > biggest leap in this number is between the third and fourth round. I can > > also see that many features are longer when I look at the files in > Geneious. > > > > Is this sort of change possible after the second round of Maker2? Is > there > > something I have done wrong in my runs, or am a understanding this output > > incorrectly? > > > > Thank you, > > Alice > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > > > -- > > > Alice Dennis > alicebdennis at gmail.com > > Postdoctoral Researcher > Institute for Integrative Biology, ETH Z?rich & EAWAG > ?berlandstrasse 133 > P.O. Box 611 > 8600 D?bendorf, Switzerland > > https://adennis5.wordpress.com/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rens.holmer at wur.nl Mon Mar 30 00:12:20 2015 From: rens.holmer at wur.nl (Holmer, Rens) Date: Mon, 30 Mar 2015 06:12:20 +0000 Subject: [maker-devel] Incorporating cufflinks in maker Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl> Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer From goutham.atla at gmail.com Fri Mar 27 23:37:08 2015 From: goutham.atla at gmail.com (Goutham atla) Date: Sat, 28 Mar 2015 11:07:08 +0530 Subject: [maker-devel] Annotating Cufflinks GTF with Maker Message-ID: Dear All, I have a draft genome for organism of my interest and I have around 150G of 100bp paired-end RNA-Seq data from different conditions. This organism has ensemble annotations but very few. My goal is to look at differential splicing analysis between two conditions. For this I need good annotations in gtf format at isoform level.I am interested in using the Splicing Analysis Kit For now, I have aligned one sample to genome using tophat2 and then used cufflinks to generate a de-novo GTF file. In either cases I have not used the avail be GTF with very few annotations. The GTF file generated by cufflinks should be annotated to know the function of each transcript. So I am interested in adding annotations to the gtf file generated from cufflinks. What is the best of doing it ? Or is there any better way of getting a gtf file, like that of ensemble, from my data ? I have looked at trinotate, but its more about functional annotation and expression studies. Regards, -- Goutham Atla -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Mon Mar 30 10:11:16 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Mon, 30 Mar 2015 16:11:16 +0000 Subject: [maker-devel] comments on Incorporating cufflinks in maker Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be> Dear Rens and Carlson, I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses. I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations. Best regards Arne Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Thu Mar 5 09:47:02 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Thu, 05 Mar 2015 17:47:02 +0100 Subject: [maker-devel] Better resolve conflicting gene models Message-ID: <54F88886.9010004@ecolevol.de> Hello, thanks for your previous advice. (Btw, how can one reply to an existing thread such that the reply will be added to the same thread?) I am trying to find the best parameters with Maker for the annotation of my genome. I have run Maker with several combinations of parameters and predictors on my three biggest scaffolds and looked at the results in Jbrowse. Overall most predictions seem fine, but there are some genes with conflicts and I have no idea why. I have: - 100Mb assembled genome - Trinity RNAseq assembly - cufflinks data (in my case don't seem to be messy as suggested, rather a good complement to the trinity data)) - protein evidence (related and unrelated species) - repeat library from repeat modeler Gene predictors used: - Augustus trained with transcripts from related species: seems to perform fine - SNAP: no convergence with Augustus even after second training. Dropped it because it predicted lots of additional low quality transcripts and sometimes disrupted final Maker transcripts. - Genemark: converged with Augustus after training (introns received from TopHat2 output). Tends to predict some additional transcripts (compared to Augustus). Few (but some) of these are covered by evidence and thus become final Maker transcripts. So the combination of Augustus and Genemark seems optimal. In general both perform well in Maker and tend to predict the same transcripts. However, I still observe some problems in the behavior of Maker which I don't understand: Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. So I thought Augustus seems a little more accurate and run Maker only with Augustus to resolve such conflicts, even though I would loose the few additional transcripts from Genemark. This is what happened: - The gene in Example 2 now has all the 17 exons. This is good! - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. I don't worry about the minor differences. The extreme cases are like two genes in a hundred and I don't understand the behavior. I was thinking that in case of conflicting models Maker will choose the one that best fits the evidence. Obviously with most conflicts this is what happens, because the majority of the final models look OK. But not the above mentioned cases and I don't understand why? Is there any parameter I missed to better resolve such conflicts? Best From bmoore at genetics.utah.edu Thu Mar 5 17:20:52 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Fri, 6 Mar 2015 00:20:52 +0000 Subject: [maker-devel] Maker Software Question In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> Message-ID: Hi Chris, I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply. I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers. MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output. This process is described in the recent paper: Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. http://www.ncbi.nlm.nih.gov/pubmed/25501943 Mike do you have a PDF of the final print version of that you could send directly to Christopher? B On Jan 16, 2015, at 8:38 AM, Seabury, Christopher > wrote: Dear Colleagues, I would like to quickly ask about a specific routine/possible function in MAKER. Previously, we have essentially made home-made versions of maker by way of Multi-step programming. At present we are exploring MAKER but are wondering IF MAKER has the ability to populate the GFF with GENE/Protein ID information? As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs, And a reference, but do not see the GENE/Protein ID in the GFF. Is there a subroutine For this, or option we have missed? Thanks and Kind Regards, Christopher M. Seabury PhD Associate Professor Department of Veterinary Pathobiology College of Veterinary Medicine Texas A&M University College Station, TX 77843-4467 cseabury at cvm.tamu.edu Mobile: 979-492-6400 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Mar 9 12:12:10 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 9 Mar 2015 18:12:10 +0000 Subject: [maker-devel] Does the maker google forum works? -[Doubt] maker2zff line 109 In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu> Hi Javier, The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there. Please join the official MAKER mailing list at: http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Thanks, B On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA > wrote: Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From javiersg at um.es Mon Mar 9 16:27:00 2015 From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA) Date: Mon, 09 Mar 2015 23:27:00 +0100 Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es> Good night everyone.I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. The last line of the gff file is the line which the mistake alert said ?that it doesnt find the file or directory. ../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF LINE 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 12 13:50:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 13:50:44 -0600 Subject: [maker-devel] Better resolve conflicting gene models In-Reply-To: <54F88886.9010004@ecolevol.de> References: <54F88886.9010004@ecolevol.de> Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com> Sorry for the slow reply. > how can one reply to an existing thread such that the reply will be added to the same thread? Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread. > Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? The gene chosen by MAKER is the one that best matches the evidence. This is a numeric value called AED (lower means better match). If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized. If a model fails to predict a base pair that is supported by evidence then it will also be penalized. The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score). Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen. > > Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. > > - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. > Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. The model chosen will always be the one with the lowest AED. The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score. I would also recommend not including cufflinks output if you have trinity data. Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t. Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence. ?Carson From carsonhh at gmail.com Thu Mar 12 14:03:11 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 14:03:11 -0600 Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires approval In-Reply-To: References: Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com> Hi Arne, The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer. So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb. ?Carson > On Mar 12, 2015, at 10:38 AM > > From: Van Hoeck Arne > > To: "maker-devel at yandell-lab.org " > > Subject: TACC lonestar and N50 value > Date: March 12, 2015 at 10:38:42 AM MDT > > > Dear MAKER developer, > > We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) > > Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? > > Best regards > Arne > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > > > Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9 > From: maker-devel-request at yandell-lab.org > Date: March 12, 2015 at 10:38:50 AM MDT > > > If you reply to this message, keeping the Subject: header intact, > Mailman will discard the held message. Do this if the message is > spam. If you reply to this message and include an Approved: header > with the list password in it, the message will be approved for posting > to the list. The Approved: header can also appear in the first line > of the body of the reply. -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Mar 12 10:38:42 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 12 Mar 2015 16:38:42 +0000 Subject: [maker-devel] TACC lonestar and N50 value Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 14:50:33 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 13:50:33 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 15:48:46 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 14:48:46 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Mar 13 18:14:52 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sat, 14 Mar 2015 00:14:52 +0000 Subject: [maker-devel] Question about pre-masked genome. In-Reply-To: References: Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu> Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them. With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible. Let us know how it works out though! Thanks, Daniel On Mar 13, 2015, at 3:48 PM, Marc Tollis > wrote: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- Marc Tollis, Ph.D. Post-Doctoral Research Associate Arizona State University LSE 313 (480) 965-7456 marc.tollis at asu.edu website: https://sites.google.com/site/tollisresearch/ blog: anolistollis.wordpress.com _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Sun Mar 15 08:19:37 2015 From: mtollis at asu.edu (Marc Tollis) Date: Sun, 15 Mar 2015 07:19:37 -0700 Subject: [maker-devel] control file for SNAP training Message-ID: This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? ? -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Mon Mar 16 07:29:36 2015 From: steinj at cshl.edu (Stein, Joshua) Date: Mon, 16 Mar 2015 13:29:36 +0000 Subject: [maker-devel] TACC lonestar and N50 value In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu> Hi Arne, I have experience with iPlant resources and with MAKER-P. I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6). You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work. Please see this tutorial, which includes information on running in MPI-mode: https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial. You can also access the TACC Lonestar installation using the iPlant Discovery Environment. There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3". Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb. This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working. Depending how this goes there are plans to roll-out additional apps intended for larger genomes. Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run. Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes. Best, Josh On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne > wrote: Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From mtollis at asu.edu Tue Mar 17 15:26:44 2015 From: mtollis at asu.edu (Marc Tollis) Date: Tue, 17 Mar 2015 14:26:44 -0700 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: I answered my own question: No need to re-align proteins again - takes too long. So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis wrote: > This is a question about process, and to make sure I am doing things right > (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the > predictive proteome from a congener (well-studied and available on > Ensembl), and generated the HMM for the first round of SNAP training. As > per the 2014 tutorial, I edited the control file for this step as follows: > I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running > blastx and exonerate however. I noticed that this is because I did not > remove (or "comment out") the path to the protein.fa in the control file > (the output looks markedly different when I do comment out the protein file > - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the > scaffolds (meaning that having protein2genome=1 is to tell maker to make > evidence annotations) ? Did I do this correctly, or should I also remove > the protein.fa out of the control file for SNAP training? > ? > -- > *Marc Tollis, Ph.D.* > *Post-Doctoral Research Associate* > *Arizona State University* > *LSE 313* > *(480) 965-7456 <%28480%29%20965-7456>* > marc.tollis at asu.edu > > *website: *https://sites.google.com/site/tollisresearch/ > *blog: *anolistollis.wordpress.com > -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 17 20:47:50 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Mar 2015 20:47:50 -0600 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy. ?Carson > On Mar 17, 2015, at 3:26 PM, Marc Tollis wrote: > > I answered my own question: > No need to re-align proteins again - takes too long. > So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! > > On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis > wrote: > This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? > ? > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com > > > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Fri Mar 20 07:17:09 2015 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Fri, 20 Mar 2015 13:17:09 +0000 Subject: [maker-devel] est2genome wrong strand Message-ID: Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? Thanks, Brian Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 >contig_69 Length=108040 Score = 1043 bits (1156), Expect = 0.0 Identities = 589/592 (99%), Gaps = 3/592 (1%) Strand=Plus/Plus Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 20 08:54:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Mar 2015 08:54:28 -0600 Subject: [maker-devel] est2genome wrong strand In-Reply-To: References: Message-ID: Hi Brian, Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time). I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error). MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript). Because single exon alignments are very unreliable, they are ignored in MAKER by default. They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files). ?Carson On Mar 20, 2015, at 7:17 AM, Mack, Brian > wrote: > Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? > > Thanks, > Brian > > Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 > >contig_69 <> > Length=108040 > > Score = 1043 bits (1156), Expect = 0.0 > Identities = 589/592 (99%), Gaps = 3/592 (1%) > Strand=Plus/Plus > > Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 > > Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 > |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 > > > > 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 > 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 > 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 > 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sat Mar 21 21:27:27 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Sun, 22 Mar 2015 14:27:27 +1100 Subject: [maker-devel] annotation stats: repeats Message-ID: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Sat Mar 21 23:56:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sun, 22 Mar 2015 05:56:06 +0000 Subject: [maker-devel] annotation stats: repeats In-Reply-To: References: Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu> Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries. The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it. Let us know whether that helps. Thanks, Daniel On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos > wrote: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos PhD Candidate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 02:29:14 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 09:29:14 +0100 Subject: [maker-devel] Augustus retraining Message-ID: Hello All, I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl" step), I get a warning for each gene that doesn't contain a start or stop codon. ..... gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? .... Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Mar 24 06:06:25 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 24 Mar 2015 23:06:25 +1100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Panos, Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. Cheers, 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and > realized that quite a few of the gene models I get based on EST data are > incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right > after the time-consuming "optimize_augustus.pl" step), I get a warning > for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon > does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon > doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene > models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here > . > The official > web > page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 06:24:45 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 13:24:45 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Xabier, Thanks for your quick reply! No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). P On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very > convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and >> realized that quite a few of the gene models I get based on EST data are >> incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right >> after the time-consuming "optimize_augustus.pl" step), I get a warning >> for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >> does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene >> models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here >> . >> The official >> web >> page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez Campos > *PhD Candidate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 08:14:51 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:14:51 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Hi Panos, EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. ?Carson > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 08:31:04 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 15:31:04 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Hi Carson, So you think it's okay to include incomplete gene models when training Augustus? I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... Thanks, Panos On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a > first round of training you can run MAKER together with protein and EST > evidence and the newly trained Augustus species file. Because MAKER gives > hints to Augustus as it runs, the models it produces will be improved over > what it would get from just running Augustus on it?s own. Then take these > gene models and use them to retrain Augustus. This is the standard > bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is > similar to Augustus) ?> http://weatherby.genetics. > utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ > Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> > http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training > files (MAKER comes with a tool that converts GFF3 for SNAP training so just > take that and convert it for Augustus)?> https://github.com/ > hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use > the legacy stand alone version), and then convert that file for bootstrap > training. > > ?Carson > > > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks > like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over > each step. Also, I have done the entire training procedure with less genes, > but didn't get a good gene-level sensitivity (~5%). So now I'm trying to > replicate it using more of my scaffolds, but as it appears I get a lot more > incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very >> convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >> >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and >>> realized that quite a few of the gene models I get based on EST data are >>> incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right >>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>> for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>> does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene >>> models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here >>> . >>> The official >>> web >>> page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 08:39:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:39:20 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). ?Carson > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. > > ?Carson > > >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 09:05:54 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 16:05:54 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. P On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt wrote: > On your first round it is fine. It gives the predictor enough to work > with, then on the second round you use improved models. When you say 6% > sensitivity is that Augustus running on it?s own? If it?s inside of MAKER > that means you are not providing sufficient protein evidence (you need the > full proteome of at least two related species). Also is that the gene > level, exon level, or nucleotide level sensitivity. If you are looking at > the gene level sensitivity measure, you only get a match when you perfectly > match all transcripts in a gene (models that may not be correct in the > first place). This value will rarely go above 10% for any predictor. You > need to use the nucleotide level sensitivity/specificity metrics. The gene > and exon level metrics are basically meaningless (unless it?s Drosophila > which is the only species annotated correctly enough to use them). > > ?Carson > > > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training > Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I > did it for SNAP, for some weird reason I forgot it for Augustus :p Do you > think, however, that I can get a big improvement in gene-level sensitivity? > Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a >> first round of training you can run MAKER together with protein and EST >> evidence and the newly trained Augustus species file. Because MAKER gives >> hints to Augustus as it runs, the models it produces will be improved over >> what it would get from just running Augustus on it?s own. Then take these >> gene models and use them to retrain Augustus. This is the standard >> bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is >> similar to Augustus) ?> http://weatherby.genetics. >> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ >> Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training >> files (MAKER comes with a tool that converts GFF3 for SNAP training so just >> take that and convert it for Augustus)?> https://github.com/ >> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use >> the legacy stand alone version), and then convert that file for bootstrap >> training. >> >> ?Carson >> >> >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis >> wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks >> like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over >> each step. Also, I have done the entire training procedure with less genes, >> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to >> replicate it using more of my scaffolds, but as it appears I get a lot more >> incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < >> xvazquezc at gmail.com> wrote: >> >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very >>> convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >>> >>>> Hello All, >>>> >>>> I'm trying to retrain Augustus using EST data from the same species and >>>> realized that quite a few of the gene models I get based on EST data are >>>> incomplete (i.e. no start and/or stop codon). >>>> >>>> Now, when I get to the "etraining" step in Augustus retraining (right >>>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>>> for each gene that doesn't contain a start or stop codon. >>>> >>>> ..... >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>>> does not begin with start codon but with acg >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>>> .... >>>> >>>> Does anyone know whether training is compromised by such incomplete >>>> gene models? Do you usually exclude them from the training set? >>>> >>>> Oh, and by the way, the best guide to retraining Augustus is here >>>> . >>>> The official >>>> >>>> web page isn't bad, but doesn't explain in detail certain things. >>>> >>>> Thanks, >>>> Panos >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> *PhD Candidate* >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 09:38:08 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 09:38:08 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com> I?d pick a couple of species that are as closely related as you can find. Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative). The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with. Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point. This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics. Thanks, Carson > On Mar 24, 2015, at 9:05 AM, Panos Ioannidis wrote: > > Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. > > I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). > > Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. > > P > > On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt > wrote: > On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). > > ?Carson > > >> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: >> >> Hi Carson, >> >> So you think it's okay to include incomplete gene models when training Augustus? >> >> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... >> >> Thanks, >> Panos >> >> >> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. >> >> ?Carson >> >> >>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >>> >>> Hi Xabier, >>> >>> Thanks for your quick reply! >>> >>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >>> >>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >>> >>> P >>> >>> >>> >>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alicebdennis at gmail.com Thu Mar 26 04:34:26 2015 From: alicebdennis at gmail.com (Alice Dennis) Date: Thu, 26 Mar 2015 11:34:26 +0100 Subject: [maker-devel] iterative Maker2 In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hello again, I posted a while ago about a genome I'm running through the Maker2 pipeline. I was concerned because my results were still changing with 3 and 4 iterations. Following the very useful advice of Carson (below), I've made a few modifications (adding a RepeatModeler run, using a big protein database), but my gene predictions are still changing between the 3rd and 4th iterations. Perhaps this is ok, but these increasing gene lengths make me worry that I haven't built stable models. Here is the short version of what I've done. 1. Run RepeatModeler, but this only produced 47 sequences in the resulting .fasta... so that seemed a bit small. 2. Run Maker2 using: - RepeatModeler output + "model_org=all" and "softmask=1" in the Repeat Masking section. - protein evidence from 2 distantly related species AND all of Uniprot - ests from a different strain of my species (a parasitoid wasp) - the .hmm from Nasonia, one of the 2 distantly related species whose proteome I also provided as protein evidence - my assembled genome of 1,509 scaffolds. 3. After this, I did three subsequent rounds of Maker2 (cleverly named Rounds 2, 3 and 4). Each one used the same input, except the Nasonia .hmm was replaced by a SNAP generated .hmm from the previous round. Also, the est2genome and protein2genome was changed from 1 to 0 in all runs after the first. Here are some results: Round1: 14,647 genes, average length 2,491 Round2: 12,158 genes, average length 3,760 Round3: 13,515 genes, average length 3,090 Round4: 12,169 genes, average length 3,918 This is a bit confusing because the number of genes predicted goes up and down, as does their lengths. I've doubly checked the dates of my files, and they are all labeled such that I don't think anything could be swapped. So my questions are: Is this an indication that my models are unstable and I shouldn't trust these predictions? Is the decreasing number of genes, while also getting longer perhaps a good thing? How do I know when to stop if genes keep getting longer? Thanks very much, Alice On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > The gene models are actually produced by SNAP, Augustus, or whatever gene > predictor you are using, so if you change the HMM every round, then the > models will change too. But I have one concern. You are using a very > sparse protein evidence dataset. The protein dataset is very important to > MAKER?s performance, and for itterative training of the ab initio > predictors. Normally after the second iteration, additional training should > not be beneficial, but if you are getting wildly different results on 3rd > and 4th round, then you probably aren?t getting sufficient good models to > train with. > > For a protein dataset you should be using the entire a proteome from a > minimum of two related species and perhaps all of UniProt/Swiss-prot to get > a broad protein database. Don?t use the proteins extracted by CEGMA and > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff scrip > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > also the HaMSTr results will be redundant with the ESTs. You need proteins > from related species to look for homology not found in the EST dataset. > > Also repeat masking is important for any genome and has a huge effect on ab > initio predictor performance. Make sure you run something like > RepeatModeler to look for species specific repeats that will not already be > in RepBase. Then add those results to the rmlib= option in the maker > control files. > > Thanks, > Carson > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice wrote: > > Hi all, > > I am a relatively new user to Maker2, and I?m looking for advise on running > many iterations of the same dataset in Maker2. > > I have a relatively small genome (~124 MB) from a wasp that is assembled > into ~1,500 scaffold. I have run several iterations of Maker2 by > re-generating .hmms in SNAP and feeding them into the next round, and my > gene predictions keep increasing (in number and in size). The only thing > that changes at each round is the .hmm. > This is the evidence that I give is: > - de novo assembled ESTs from a different strain of the same > species (70,000 contigs? I am currently working on improving this assembly > with the hope that this will be helpful here) > - 610 proteins extracted from the genome scaffolds using CEGMA and > HaMSTr > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > est2genome/protein2genome option. > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > previous round, all without the est2genome/protein2genome option. All other > files are the same as in the original run. > > As I understand it, after the second round, nothing should change in Maker2. > But the differences are obvious between runs. Some entirely new exons are > annotated. For example, just counting ?exon? in the .gff file gives me > 73,000 after the third iteration and 96,000 after the fourth! Actually the > biggest leap in this number is between the third and fourth round. I can > also see that many features are longer when I look at the files in Geneious. > > Is this sort of change possible after the second round of Maker2? Is there > something I have done wrong in my runs, or am a understanding this output > incorrectly? > > Thank you, > Alice > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Alice Dennis alicebdennis at gmail.com Postdoctoral Researcher Institute for Integrative Biology, ETH Z?rich & EAWAG ?berlandstrasse 133 P.O. Box 611 8600 D?bendorf, Switzerland https://adennis5.wordpress.com/ From michael.s.campbell1 at gmail.com Thu Mar 26 09:50:41 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 26 Mar 2015 09:50:41 -0600 Subject: [maker-devel] iterative Maker2 In-Reply-To: References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hi Alice, In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening. Mike On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis wrote: > Hello again, > > I posted a while ago about a genome I'm running through the Maker2 > pipeline. I was concerned because my results were still changing with > 3 and 4 iterations. > > Following the very useful advice of Carson (below), I've made a few > modifications (adding a RepeatModeler run, using a big protein > database), but my gene predictions are still changing between the 3rd > and 4th iterations. Perhaps this is ok, but these increasing gene > lengths make me worry that I haven't built stable models. > > Here is the short version of what I've done. > 1. Run RepeatModeler, but this only produced 47 sequences in the > resulting .fasta... so that seemed a bit small. > > 2. Run Maker2 using: > - RepeatModeler output + "model_org=all" and "softmask=1" in the > Repeat Masking section. > - protein evidence from 2 distantly related species AND all of Uniprot > - ests from a different strain of my species (a parasitoid wasp) > - the .hmm from Nasonia, one of the 2 distantly related species whose > proteome I also provided as protein evidence > - my assembled genome of 1,509 scaffolds. > > 3. After this, I did three subsequent rounds of Maker2 (cleverly named > Rounds 2, 3 and 4). Each one used the same input, except the Nasonia > .hmm was replaced by a SNAP generated .hmm from the previous round. > Also, the est2genome and protein2genome was changed from 1 to 0 in all > runs after the first. > > Here are some results: > Round1: 14,647 genes, average length 2,491 > Round2: 12,158 genes, average length 3,760 > Round3: 13,515 genes, average length 3,090 > Round4: 12,169 genes, average length 3,918 > > This is a bit confusing because the number of genes predicted goes up > and down, as does their lengths. I've doubly checked the dates of my > files, and they are all labeled such that I don't think anything could > be swapped. > > So my questions are: > Is this an indication that my models are unstable and I shouldn't > trust these predictions? > Is the decreasing number of genes, while also getting longer perhaps a > good thing? > How do I know when to stop if genes keep getting longer? > > > Thanks very much, > Alice > > > On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > > The gene models are actually produced by SNAP, Augustus, or whatever gene > > predictor you are using, so if you change the HMM every round, then the > > models will change too. But I have one concern. You are using a very > > sparse protein evidence dataset. The protein dataset is very important > to > > MAKER?s performance, and for itterative training of the ab initio > > predictors. Normally after the second iteration, additional training > should > > not be beneficial, but if you are getting wildly different results on 3rd > > and 4th round, then you probably aren?t getting sufficient good models to > > train with. > > > > For a protein dataset you should be using the entire a proteome from a > > minimum of two related species and perhaps all of UniProt/Swiss-prot to > get > > a broad protein database. Don?t use the proteins extracted by CEGMA and > > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff > scrip > > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > > also the HaMSTr results will be redundant with the ESTs. You need > proteins > > from related species to look for homology not found in the EST dataset. > > > > Also repeat masking is important for any genome and has a huge effect on > ab > > initio predictor performance. Make sure you run something like > > RepeatModeler to look for species specific repeats that will not already > be > > in RepBase. Then add those results to the rmlib= option in the maker > > control files. > > > > Thanks, > > Carson > > > > > > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice > wrote: > > > > Hi all, > > > > I am a relatively new user to Maker2, and I?m looking for advise on > running > > many iterations of the same dataset in Maker2. > > > > I have a relatively small genome (~124 MB) from a wasp that is assembled > > into ~1,500 scaffold. I have run several iterations of Maker2 by > > re-generating .hmms in SNAP and feeding them into the next round, and my > > gene predictions keep increasing (in number and in size). The only thing > > that changes at each round is the .hmm. > > This is the evidence that I give is: > > - de novo assembled ESTs from a different strain of the same > > species (70,000 contigs? I am currently working on improving this > assembly > > with the hope that this will be helpful here) > > - 610 proteins extracted from the genome scaffolds using CEGMA > and > > HaMSTr > > > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > > est2genome/protein2genome option. > > > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > > previous round, all without the est2genome/protein2genome option. All > other > > files are the same as in the original run. > > > > As I understand it, after the second round, nothing should change in > Maker2. > > But the differences are obvious between runs. Some entirely new exons are > > annotated. For example, just counting ?exon? in the .gff file gives me > > 73,000 after the third iteration and 96,000 after the fourth! Actually > the > > biggest leap in this number is between the third and fourth round. I can > > also see that many features are longer when I look at the files in > Geneious. > > > > Is this sort of change possible after the second round of Maker2? Is > there > > something I have done wrong in my runs, or am a understanding this output > > incorrectly? > > > > Thank you, > > Alice > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > > > -- > > > Alice Dennis > alicebdennis at gmail.com > > Postdoctoral Researcher > Institute for Integrative Biology, ETH Z?rich & EAWAG > ?berlandstrasse 133 > P.O. Box 611 > 8600 D?bendorf, Switzerland > > https://adennis5.wordpress.com/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rens.holmer at wur.nl Mon Mar 30 00:12:20 2015 From: rens.holmer at wur.nl (Holmer, Rens) Date: Mon, 30 Mar 2015 06:12:20 +0000 Subject: [maker-devel] Incorporating cufflinks in maker Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl> Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer From goutham.atla at gmail.com Fri Mar 27 23:37:08 2015 From: goutham.atla at gmail.com (Goutham atla) Date: Sat, 28 Mar 2015 11:07:08 +0530 Subject: [maker-devel] Annotating Cufflinks GTF with Maker Message-ID: Dear All, I have a draft genome for organism of my interest and I have around 150G of 100bp paired-end RNA-Seq data from different conditions. This organism has ensemble annotations but very few. My goal is to look at differential splicing analysis between two conditions. For this I need good annotations in gtf format at isoform level.I am interested in using the Splicing Analysis Kit For now, I have aligned one sample to genome using tophat2 and then used cufflinks to generate a de-novo GTF file. In either cases I have not used the avail be GTF with very few annotations. The GTF file generated by cufflinks should be annotated to know the function of each transcript. So I am interested in adding annotations to the gtf file generated from cufflinks. What is the best of doing it ? Or is there any better way of getting a gtf file, like that of ensemble, from my data ? I have looked at trinotate, but its more about functional annotation and expression studies. Regards, -- Goutham Atla -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Mon Mar 30 10:11:16 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Mon, 30 Mar 2015 16:11:16 +0000 Subject: [maker-devel] comments on Incorporating cufflinks in maker Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be> Dear Rens and Carlson, I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses. I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations. Best regards Arne Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From kai.kamm at ecolevol.de Thu Mar 5 09:47:02 2015 From: kai.kamm at ecolevol.de (Kai Kamm) Date: Thu, 05 Mar 2015 17:47:02 +0100 Subject: [maker-devel] Better resolve conflicting gene models Message-ID: <54F88886.9010004@ecolevol.de> Hello, thanks for your previous advice. (Btw, how can one reply to an existing thread such that the reply will be added to the same thread?) I am trying to find the best parameters with Maker for the annotation of my genome. I have run Maker with several combinations of parameters and predictors on my three biggest scaffolds and looked at the results in Jbrowse. Overall most predictions seem fine, but there are some genes with conflicts and I have no idea why. I have: - 100Mb assembled genome - Trinity RNAseq assembly - cufflinks data (in my case don't seem to be messy as suggested, rather a good complement to the trinity data)) - protein evidence (related and unrelated species) - repeat library from repeat modeler Gene predictors used: - Augustus trained with transcripts from related species: seems to perform fine - SNAP: no convergence with Augustus even after second training. Dropped it because it predicted lots of additional low quality transcripts and sometimes disrupted final Maker transcripts. - Genemark: converged with Augustus after training (introns received from TopHat2 output). Tends to predict some additional transcripts (compared to Augustus). Few (but some) of these are covered by evidence and thus become final Maker transcripts. So the combination of Augustus and Genemark seems optimal. In general both perform well in Maker and tend to predict the same transcripts. However, I still observe some problems in the behavior of Maker which I don't understand: Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. So I thought Augustus seems a little more accurate and run Maker only with Augustus to resolve such conflicts, even though I would loose the few additional transcripts from Genemark. This is what happened: - The gene in Example 2 now has all the 17 exons. This is good! - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. I don't worry about the minor differences. The extreme cases are like two genes in a hundred and I don't understand the behavior. I was thinking that in case of conflicting models Maker will choose the one that best fits the evidence. Obviously with most conflicts this is what happens, because the majority of the final models look OK. But not the above mentioned cases and I don't understand why? Is there any parameter I missed to better resolve such conflicts? Best From bmoore at genetics.utah.edu Thu Mar 5 17:20:52 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Fri, 6 Mar 2015 00:20:52 +0000 Subject: [maker-devel] Maker Software Question In-Reply-To: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> References: <6ECE70F32BE356478EC060940C2AA4D101175BBF02@CVMMB02.cvm.tamu.edu> Message-ID: Hi Chris, I don?t know if others have responded to you already, but I just found this e-mail unanswered in the depths of my e-mail inbox, so apologies for no reply. I?m going to forward you e-mail along to the full maker mailing list so that you?ll get the benefit of response from the full group of MAKER developers. MAKER does include a couple of scripts that add functional annotation to MAKER GFF3 output. This process is described in the recent paper: Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. http://www.ncbi.nlm.nih.gov/pubmed/25501943 Mike do you have a PDF of the final print version of that you could send directly to Christopher? B On Jan 16, 2015, at 8:38 AM, Seabury, Christopher > wrote: Dear Colleagues, I would like to quickly ask about a specific routine/possible function in MAKER. Previously, we have essentially made home-made versions of maker by way of Multi-step programming. At present we are exploring MAKER but are wondering IF MAKER has the ability to populate the GFF with GENE/Protein ID information? As an initial experiment, we gave MAKER cDNA refseqs, plus corresponding Protein refseqs, And a reference, but do not see the GENE/Protein ID in the GFF. Is there a subroutine For this, or option we have missed? Thanks and Kind Regards, Christopher M. Seabury PhD Associate Professor Department of Veterinary Pathobiology College of Veterinary Medicine Texas A&M University College Station, TX 77843-4467 cseabury at cvm.tamu.edu Mobile: 979-492-6400 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmoore at genetics.utah.edu Mon Mar 9 12:12:10 2015 From: bmoore at genetics.utah.edu (Barry Moore) Date: Mon, 9 Mar 2015 18:12:10 +0000 Subject: [maker-devel] Does the maker google forum works? -[Doubt] maker2zff line 109 In-Reply-To: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> References: <20150309190735.Horde.OZGtWxZkHnkFxArh6ONp6A2@webmail.um.es> Message-ID: <13826169-DFBB-4E32-AFB1-A01CB3EEE0A4@genetics.utah.edu> Hi Javier, The Google forum is only a mirror of the actual mailing list that allows it gets archived by Google (better Google search results for MAKER users), so posting is not allowed there. Please join the official MAKER mailing list at: http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Thanks, B On Mar 9, 2015, at 12:07 PM, FRANCISCO JAVIER SANCHEZ GARCIA > wrote: Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From javiersg at um.es Mon Mar 9 16:27:00 2015 From: javiersg at um.es (FRANCISCO JAVIER SANCHEZ GARCIA) Date: Mon, 09 Mar 2015 23:27:00 +0100 Subject: [maker-devel] [Doubt] maker2zff line 109Good morning. I am trying problem to write messages in the help forum of maker in google groups. I dont know if my problem or contrary it might be a problem with the permissions. But i cant see the red button of new threads. Anyway, I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. ../maker/marker_v2.31.8/maker/bin/maker2zff ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file or directory at ../maker/marker_v2.31.8/maker/bin/maker2zff line 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Message-ID: <20150309232700.Horde.39Aj_iELfhGei1Bv5JG1fw6@webmail.um.es> Good night everyone.I will try to show my problem with maker2zff. Which does not work. My version is the v2.31.8. The last line of the gff file is the line which the mistake alert said ?that it doesnt find the file or directory. ../maker/marker_v2.31.8/maker/bin/maker2zff?? ../sequences.all.gff everything.ann everything.dna [sudo] password for soba: No such file?or directory?at?../maker/marker_v2.31.8/maker/bin/MAKER2ZFF LINE 109, line 1922870. I read something about the problematic characters in the ID . But i dont know if it is my example. http://gmod.827538.n3.nabble.com/Re-Error-when-running-maker2zff-script-td4041743.html Regards. Thanks in advance. Fco Javier S?nchez-Garc?a PhD student Forest Entomology ?rea de Biolog?a Animal Departamento de Zoolog?a y Antropolog?a F?sica Facultad de Veterinaria Universidad de Murcia Campus de Espinardo 30100 Murcia (Espa?a-Spain) Telf. +34 660 500 416 (mobile phone) +34 868 888 031 (laboratory-work phone) http://scholar.google.es/citations?user=AKoUT8cAAAAJ&hl http://www.researchgate.net/profile/Francisco_Sanchez-Garcia http://orcid.org/0000-0002-5442-0292 http://www.researcherid.com/rid/M-2407-2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 12 13:50:44 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 13:50:44 -0600 Subject: [maker-devel] Better resolve conflicting gene models In-Reply-To: <54F88886.9010004@ecolevol.de> References: <54F88886.9010004@ecolevol.de> Message-ID: <7B8D93E6-DEE7-43D7-BFBF-38E474311A6C@gmail.com> Sorry for the slow reply. > how can one reply to an existing thread such that the reply will be added to the same thread? Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn?t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread. > Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior? The gene chosen by MAKER is the one that best matches the evidence. This is a numeric value called AED (lower means better match). If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized. If a model fails to predict a base pair that is supported by evidence then it will also be penalized. The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score). Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen. > > Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR. > > - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. > Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand. The model chosen will always be the one with the lowest AED. The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score. I would also recommend not including cufflinks output if you have trinity data. Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn?t. Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence. ?Carson From carsonhh at gmail.com Thu Mar 12 14:03:11 2015 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 12 Mar 2015 14:03:11 -0600 Subject: [maker-devel] maker-devel post from avhoeck@sckcen.be requires approval In-Reply-To: References: Message-ID: <73D91049-1CB8-4C59-A9E5-58374FCB0789@gmail.com> Hi Arne, The genes found by CEGMA are very short genes, so where those genes might be identifiable at least partially on shorter contigs other genes will be far longer. So the 87% value you get from CEGMA is probably what you are going to be able to find (even partially) for the entire genome. Remember that gene size when you include the introns can actually be quite large, and gene predictors need flanking signals from the sequence several hundred bp upstream and downstream to make a prediction, so having a gene partially contained by a contig makes that gene un-annotatable for ab initio predictors. Unless the organism has a high gene density or very few introns, you usually won?t be able to find a gene on contigs smaller than ~10kb. ?Carson > On Mar 12, 2015, at 10:38 AM > > From: Van Hoeck Arne > > To: "maker-devel at yandell-lab.org " > > Subject: TACC lonestar and N50 value > Date: March 12, 2015 at 10:38:42 AM MDT > > > Dear MAKER developer, > > We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) > > Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? > > Best regards > Arne > > > Consider the environment before you print > Denk aan het milieu voor u deze e-mail print > Pensez ? l'environnement avant d'imprimer > > > SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer > > > Subject: confirm ff58f4c50902af9cf748cc6c032a1ee92a2a19a9 > From: maker-devel-request at yandell-lab.org > Date: March 12, 2015 at 10:38:50 AM MDT > > > If you reply to this message, keeping the Subject: header intact, > Mailman will discard the held message. Do this if the message is > spam. If you reply to this message and include an Approved: header > with the list password in it, the message will be approved for posting > to the list. The Approved: header can also appear in the first line > of the body of the reply. -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Thu Mar 12 10:38:42 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Thu, 12 Mar 2015 16:38:42 +0000 Subject: [maker-devel] TACC lonestar and N50 value Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 14:50:33 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 13:50:33 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Fri Mar 13 15:48:46 2015 From: mtollis at asu.edu (Marc Tollis) Date: Fri, 13 Mar 2015 14:48:46 -0700 Subject: [maker-devel] Question about pre-masked genome. Message-ID: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Fri Mar 13 18:14:52 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sat, 14 Mar 2015 00:14:52 +0000 Subject: [maker-devel] Question about pre-masked genome. In-Reply-To: References: Message-ID: <51DE3D97-10DB-4FF9-A793-EEA051C705BD@genetics.utah.edu> Hi Marc, Several groups have been doing what you described with fungal genomes for a while and it seems to be working for them. With the softmasked genome, maker will still allow EST alignments to extend into the masked regions; if it were a hard masked genome, this wouldn?t be possible. Let us know how it works out though! Thanks, Daniel On Mar 13, 2015, at 3:48 PM, Marc Tollis > wrote: Hello, I have a genome that has been well-annotated for species-specific repeats (using RepeatModeler, trf, and RepeatMasker), and was wondering if I could simply use the soft-masked assembly for maker annotation, in order to skip the sometimes cumbersome repeatmasking step. Is this generally looked upon as doable, or should I just stick with using the unmasked assembly and the repeatmasking as bundled with my maker installation? P.S. - as a spoiler, I already tried this (for snap training), but could not confirm that the masked assembly (i.e., lowercase letters) was maintained in the scaffolds as viewed in the maker output directory. Thanks, Marc -- Marc Tollis, Ph.D. Post-Doctoral Research Associate Arizona State University LSE 313 (480) 965-7456 marc.tollis at asu.edu website: https://sites.google.com/site/tollisresearch/ blog: anolistollis.wordpress.com _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mtollis at asu.edu Sun Mar 15 08:19:37 2015 From: mtollis at asu.edu (Marc Tollis) Date: Sun, 15 Mar 2015 07:19:37 -0700 Subject: [maker-devel] control file for SNAP training Message-ID: This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? ? -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Mon Mar 16 07:29:36 2015 From: steinj at cshl.edu (Stein, Joshua) Date: Mon, 16 Mar 2015 13:29:36 +0000 Subject: [maker-devel] TACC lonestar and N50 value In-Reply-To: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> References: <9BCA01D5BDC2AF46822CA182B4FBD0DF117AB267@MAILSRV3.sck.be> Message-ID: <0A0DBB88-EC95-43B7-AA75-B7858B67A8F4@cshl.edu> Hi Arne, I have experience with iPlant resources and with MAKER-P. I would encourage you to try the Atmosphere image (7888b8e1-c006-4794-82d9-4c940ddbf4c6). You can request a large instance (up to 16 CPU's and 128 GB memory) and run in MPI-mode to distribute the work. Please see this tutorial, which includes information on running in MPI-mode: https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial. You can also access the TACC Lonestar installation using the iPlant Discovery Environment. There is an app called "MAKER-P-Lonestar-Small-Genomes 2.3". Although it is advertised as appropriate for "small" genomes, I think there is a good chance that it will work for 450 Mb. This is a new resource and the iPlant team would value any feedback and benchmarks on how the system is working. Depending how this goes there are plans to roll-out additional apps intended for larger genomes. Here is a tutorial: https://pods.iplantcollaborative.org/wiki/display/sciplant/Tutorial+for+running+MAKER-P+on+TACC-Lonestar+from+iPlant+Discovery+Environment Regarding contig sizes, though not ideal, you can include contigs smaller than 10kbp in your run. Plant genes tend to be more compact than vertebrate genes so you ought to be able to recover annotations on the smaller contigs, though keep an eye out for truncated genes. Best, Josh On Mar 13, 2015, at 6:06 PM, Van Hoeck Arne > wrote: Dear MAKER developer, We have a plant genome of about 450 Mbp with an N50 value of 20 kbp whereas only 3/4 (333 Mbp) are contigs longer than 10 kbp. CEGMA said that 87% of the genes were found, whereas 94 % were partial identified. You said last time that contigs smaller than 10kbp are not ideal for annotating and preferable to throw them away. Does this mean that I lose all genes present in the small contigs? Or is there another way to annotate them? (is concatenating all the small contigs together with 500 N's between each contig an option?) Besides, i could run succesfully Maker via iplant's atmoshpere. However, for my large genomes i registred myself at the TACC lonestar cluster but Dave C. replied that i won't be able to run on the TACC supercomputers without an allocation. He said that I need to contact my PI. With my loginID, i haven't any acces to the cluser via ssh since my permission was denied. Therefore, is it possible to use the TACC supercomputers to run MAKER? Best regards Arne [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From mtollis at asu.edu Tue Mar 17 15:26:44 2015 From: mtollis at asu.edu (Marc Tollis) Date: Tue, 17 Mar 2015 14:26:44 -0700 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: I answered my own question: No need to re-align proteins again - takes too long. So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis wrote: > This is a question about process, and to make sure I am doing things right > (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the > predictive proteome from a congener (well-studied and available on > Ensembl), and generated the HMM for the first round of SNAP training. As > per the 2014 tutorial, I edited the control file for this step as follows: > I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running > blastx and exonerate however. I noticed that this is because I did not > remove (or "comment out") the path to the protein.fa in the control file > (the output looks markedly different when I do comment out the protein file > - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the > scaffolds (meaning that having protein2genome=1 is to tell maker to make > evidence annotations) ? Did I do this correctly, or should I also remove > the protein.fa out of the control file for SNAP training? > ? > -- > *Marc Tollis, Ph.D.* > *Post-Doctoral Research Associate* > *Arizona State University* > *LSE 313* > *(480) 965-7456 <%28480%29%20965-7456>* > marc.tollis at asu.edu > > *website: *https://sites.google.com/site/tollisresearch/ > *blog: *anolistollis.wordpress.com > -- *Marc Tollis, Ph.D.* *Post-Doctoral Research Associate* *Arizona State University* *LSE 313* *(480) 965-7456* marc.tollis at asu.edu *website: *https://sites.google.com/site/tollisresearch/ *blog: *anolistollis.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 17 20:47:50 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 17 Mar 2015 20:47:50 -0600 Subject: [maker-devel] control file for SNAP training In-Reply-To: References: Message-ID: You can also add additional protein file if you want more evidence than that already in the GFF3. It makes re-annotation and adding evidence to an already annotated genome file really easy. ?Carson > On Mar 17, 2015, at 3:26 PM, Marc Tollis wrote: > > I answered my own question: > No need to re-align proteins again - takes too long. > So, I used the gff file from the gff_merge on the log file from the first run (the one with just protein2genome). Then, after generating the .hmm file, I put it in my control file, along with protein2genome=0, removed the protein.fasta, set maker_gff and protein_pass=1. The output now shows that only snap is running, and no blastx and exonerate - a relief because it is much faster! > > On Sun, Mar 15, 2015 at 7:19 AM, Marc Tollis > wrote: > This is a question about process, and to make sure I am doing things right (when time is of the essence, some mistakes can set you back weeks). > > I have run maker on my de novo vertebrate genome, using only the predictive proteome from a congener (well-studied and available on Ensembl), and generated the HMM for the first round of SNAP training. As per the 2014 tutorial, I edited the control file for this step as follows: I added the path to the .hmm file, and set protein2genome to 0. > > When I run maker, I notice that in addition to snap, it is still running blastx and exonerate however. I noticed that this is because I did not remove (or "comment out") the path to the protein.fa in the control file (the output looks markedly different when I do comment out the protein file - and I can't even tell if it's running snap in this instance). > > Is it simply using exonerate to place the ab initio predictions on the scaffolds (meaning that having protein2genome=1 is to tell maker to make evidence annotations) ? Did I do this correctly, or should I also remove the protein.fa out of the control file for SNAP training? > ? > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com > > > -- > Marc Tollis, Ph.D. > Post-Doctoral Research Associate > Arizona State University > LSE 313 > (480) 965-7456 > marc.tollis at asu.edu > > website: https://sites.google.com/site/tollisresearch/ > blog: anolistollis.wordpress.com _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Mack at ARS.USDA.GOV Fri Mar 20 07:17:09 2015 From: Brian.Mack at ARS.USDA.GOV (Mack, Brian) Date: Fri, 20 Mar 2015 13:17:09 +0000 Subject: [maker-devel] est2genome wrong strand Message-ID: Hi, I've noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I've copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I've noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? Thanks, Brian Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 >contig_69 Length=108040 Score = 1043 bits (1156), Expect = 0.0 Identities = 589/592 (99%), Gaps = 3/592 (1%) Strand=Plus/Plus Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 20 08:54:28 2015 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 20 Mar 2015 08:54:28 -0600 Subject: [maker-devel] est2genome wrong strand In-Reply-To: References: Message-ID: Hi Brian, Multi-exon ESTs are stranded by their splice sites which can only work on one strand. Single exon ESTs on the other hand are stranded by their open reading frame. The standard chemistry used in most sequencing does not allow for strand specific sequence. There are technologies that do, but unless you used one of those, re-stranding single exon ESTs based on ORF is the best way to figure out where single exon alignments should go (not a completely reliable method but it works most of the time). I believe trinity tries to determine strand the same way (but it is unreliable there too). For example since your alignment is not an exact match to the genomic sequence (single bp deletion in the alignment) the best open reading frame in the transcript is not the same as the best open reading frame in the genomic sequence (so one of them likely contains an error). MAKER since it is annotating the genome logically re-strands it to the genome and trinity (being unaware of the genome strands it to the transcript). Because single exon alignments are very unreliable, they are ignored in MAKER by default. They will not be used as hints for gene predictors and can only be used to support a gene if there is also protein evidence or a single exon ab initio prediction at the same location to support it (even then this will only happen if you set single_exon=1 in the control files). ?Carson On Mar 20, 2015, at 7:17 AM, Mack, Brian > wrote: > Hi, I?ve noticed a what seems to be a flipping of the strand of some of my transcripts from est2genome. I assembled my directional rna-seq reads using Trinity. I?ve copied an example below. The blastn within maker shows the transcript aligning on the positive strand as does blastn against my genome in sequencserver. But the est2genome shows the strand to be negative. I?ve noticed this for quite a few transcripts while examining it in WebApollo. Any ideas what might be causing this? > > Thanks, > Brian > > Query= comp17103_c1_seq3 len=612 path=[8488307:0-177 8492039:178-496 > >contig_69 <> > Length=108040 > > Score = 1043 bits (1156), Expect = 0.0 > Identities = 589/592 (99%), Gaps = 3/592 (1%) > Strand=Plus/Plus > > Query 24 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 83 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105546 TCTTTATTCTTTTATTTCCACTTGAGCAATTATTTCCGGGTCAACCTATTCGGTCGTTCT 105605 > > Query 84 CTCCGTTGAGCC-TTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 142 > |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct 105606 CTCCGTTGAGCCCTTCCCTCCCCAAGTAATATTGGAAAGTCGTTCTCTCGCTCATAATTT 105665 > > > > 69 blastn expressed_sequence_match 105546 106137 559 + . ID=69:hit:182380:3.2.0.0;Name=comp17103_c1_seq3 > 69 blastn match_part 105546 106137 559 + . ID=69:hsp:377369:3.2.0.0;Parent=69:hit:182380:3.2.0.0;Target=comp17103_c1_seq3 24 612 +;Gap=M70 D1 M85 D1 M75 D1 M359 > 69 est2genome expressed_sequence_match 105546 106137 2909 - . ID=69:hit:182644:3.2.0.0;Name=comp17103_c1_seq3 > 69 est2genome match_part 105546 106137 2909 - . ID=69:hsp:377775:3.2.0.0;Parent=69:hit:182644:3.2.0.0;Target=comp17103_c1_seq3 24 612 -;Gap=M70 D1 M85 D1 M75 D1 M359 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sat Mar 21 21:27:27 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Sun, 22 Mar 2015 14:27:27 +1100 Subject: [maker-devel] annotation stats: repeats Message-ID: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From dence at genetics.utah.edu Sat Mar 21 23:56:06 2015 From: dence at genetics.utah.edu (Daniel Ence) Date: Sun, 22 Mar 2015 05:56:06 +0000 Subject: [maker-devel] annotation stats: repeats In-Reply-To: References: Message-ID: <4FFB960C-CF1A-440F-B767-36EC369D2B58@genetics.utah.edu> Hi Xabier, MAKER offer several options for repeat-masking. There?s a file of repetitive element proteins that is run with repeat runner, and there you can run RepeatMasker with a custom library and/or one of the default RepeatMasker libraries. The results from these programs are in the gff3 files that maker generates. Depending on the library that you give RepeatMasker, the repeats might be classified as to transposable element family and simple vs. complex repeats. These are probably a good place to start, but you?ll have to extract that data from the gff3 file and compile it. Let us know whether that helps. Thanks, Daniel On Mar 21, 2015, at 9:27 PM, Xabier V?zquez Campos > wrote: Hi all, I was wondering how can I get data about the repeat content of the genome from maker if possible, as well as each type of repeats: RE, transposons, simple repeats, low complexity repeats Thank you in advance, Xabier -- Xabier V?zquez Campos PhD Candidate Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 02:29:14 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 09:29:14 +0100 Subject: [maker-devel] Augustus retraining Message-ID: Hello All, I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl" step), I get a warning for each gene that doesn't contain a start or stop codon. ..... gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? .... Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. Thanks, Panos -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Mar 24 06:06:25 2015 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez_Campos?=) Date: Tue, 24 Mar 2015 23:06:25 +1100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Panos, Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. Cheers, 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and > realized that quite a few of the gene models I get based on EST data are > incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right > after the time-consuming "optimize_augustus.pl" step), I get a warning > for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon > does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 > transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon > doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene > models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here > . > The official > web > page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Xabier V?zquez Campos *PhD Candidate* Water Research Centre School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 06:24:45 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 13:24:45 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: Hi Xabier, Thanks for your quick reply! No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). P On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very > convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : > >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and >> realized that quite a few of the gene models I get based on EST data are >> incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right >> after the time-consuming "optimize_augustus.pl" step), I get a warning >> for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >> does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene >> models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here >> . >> The official >> web >> page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > > -- > Xabier V?zquez Campos > *PhD Candidate* > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 08:14:51 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:14:51 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: Message-ID: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Hi Panos, EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. ?Carson > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: > Hi Panos, > > Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. > > Cheers, > > 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: > Hello All, > > I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). > > Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. > > ..... > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg > gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? > .... > > Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? > > Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. > > Thanks, > Panos > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > -- > Xabier V?zquez Campos > PhD Candidate > Water Research Centre > School of Civil and Environmental Engineering > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 08:31:04 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 15:31:04 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Hi Carson, So you think it's okay to include incomplete gene models when training Augustus? I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... Thanks, Panos On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a > first round of training you can run MAKER together with protein and EST > evidence and the newly trained Augustus species file. Because MAKER gives > hints to Augustus as it runs, the models it produces will be improved over > what it would get from just running Augustus on it?s own. Then take these > gene models and use them to retrain Augustus. This is the standard > bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is > similar to Augustus) ?> http://weatherby.genetics. > utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ > Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> > http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training > files (MAKER comes with a tool that converts GFF3 for SNAP training so just > take that and convert it for Augustus)?> https://github.com/ > hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use > the legacy stand alone version), and then convert that file for bootstrap > training. > > ?Carson > > > On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: > > Hi Xabier, > > Thanks for your quick reply! > > No, I haven't used WebAugustus, but I just checked it out and it looks > like my training set is too big (~300 Mbp), so I can't even upload it! > > Anyway, I prefer to train it locally because I have better control over > each step. Also, I have done the entire training procedure with less genes, > but didn't get a good gene-level sensitivity (~5%). So now I'm trying to > replicate it using more of my scaffolds, but as it appears I get a lot more > incomplete models from exonerate (run through Maker). > > P > > > > On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < > xvazquezc at gmail.com> wrote: > >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very >> convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >> >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and >>> realized that quite a few of the gene models I get based on EST data are >>> incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right >>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>> for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>> does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene >>> models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here >>> . >>> The official >>> web >>> page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >> >> >> -- >> Xabier V?zquez Campos >> *PhD Candidate* >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 08:39:20 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 08:39:20 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). ?Carson > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: > Hi Panos, > > EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. > > More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors > Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html > and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl > > Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. > > ?Carson > > >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >> Hi Panos, >> >> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >> >> Cheers, >> >> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >> Hello All, >> >> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >> >> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >> >> ..... >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >> .... >> >> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >> >> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >> >> Thanks, >> Panos >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> >> >> -- >> Xabier V?zquez Campos >> PhD Candidate >> Water Research Centre >> School of Civil and Environmental Engineering >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From panos.ioannidis at gmail.com Tue Mar 24 09:05:54 2015 From: panos.ioannidis at gmail.com (Panos Ioannidis) Date: Tue, 24 Mar 2015 16:05:54 +0100 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. P On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt wrote: > On your first round it is fine. It gives the predictor enough to work > with, then on the second round you use improved models. When you say 6% > sensitivity is that Augustus running on it?s own? If it?s inside of MAKER > that means you are not providing sufficient protein evidence (you need the > full proteome of at least two related species). Also is that the gene > level, exon level, or nucleotide level sensitivity. If you are looking at > the gene level sensitivity measure, you only get a match when you perfectly > match all transcripts in a gene (models that may not be correct in the > first place). This value will rarely go above 10% for any predictor. You > need to use the nucleotide level sensitivity/specificity metrics. The gene > and exon level metrics are basically meaningless (unless it?s Drosophila > which is the only species annotated correctly enough to use them). > > ?Carson > > > On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: > > Hi Carson, > > So you think it's okay to include incomplete gene models when training > Augustus? > > I'll certainly try the bootstrap method you're suggesting. Even though I > did it for SNAP, for some weird reason I forgot it for Augustus :p Do you > think, however, that I can get a big improvement in gene-level sensitivity? > Currently, I have only 6%... > > Thanks, > Panos > > > On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt wrote: > >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a >> first round of training you can run MAKER together with protein and EST >> evidence and the newly trained Augustus species file. Because MAKER gives >> hints to Augustus as it runs, the models it produces will be improved over >> what it would get from just running Augustus on it?s own. Then take these >> gene models and use them to retrain Augustus. This is the standard >> bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is >> similar to Augustus) ?> http://weatherby.genetics. >> utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_ >> Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> >> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training >> files (MAKER comes with a tool that converts GFF3 for SNAP training so just >> take that and convert it for Augustus)?> https://github.com/ >> hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use >> the legacy stand alone version), and then convert that file for bootstrap >> training. >> >> ?Carson >> >> >> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis >> wrote: >> >> Hi Xabier, >> >> Thanks for your quick reply! >> >> No, I haven't used WebAugustus, but I just checked it out and it looks >> like my training set is too big (~300 Mbp), so I can't even upload it! >> >> Anyway, I prefer to train it locally because I have better control over >> each step. Also, I have done the entire training procedure with less genes, >> but didn't get a good gene-level sensitivity (~5%). So now I'm trying to >> replicate it using more of my scaffolds, but as it appears I get a lot more >> incomplete models from exonerate (run through Maker). >> >> P >> >> >> >> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos < >> xvazquezc at gmail.com> wrote: >> >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very >>> convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis : >>> >>>> Hello All, >>>> >>>> I'm trying to retrain Augustus using EST data from the same species and >>>> realized that quite a few of the gene models I get based on EST data are >>>> incomplete (i.e. no start and/or stop codon). >>>> >>>> Now, when I get to the "etraining" step in Augustus retraining (right >>>> after the time-consuming "optimize_augustus.pl" step), I get a warning >>>> for each gene that doesn't contain a start or stop codon. >>>> >>>> ..... >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon >>>> does not begin with start codon but with acg >>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 >>>> transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon >>>> doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>>> .... >>>> >>>> Does anyone know whether training is compromised by such incomplete >>>> gene models? Do you usually exclude them from the training set? >>>> >>>> Oh, and by the way, the best guide to retraining Augustus is here >>>> . >>>> The official >>>> >>>> web page isn't bad, but doesn't explain in detail certain things. >>>> >>>> Thanks, >>>> Panos >>>> >>>> >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> *PhD Candidate* >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Mar 24 09:38:08 2015 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 24 Mar 2015 09:38:08 -0600 Subject: [maker-devel] Augustus retraining In-Reply-To: References: <09C2F8D0-7014-45DD-B067-86ABB37F4A68@gmail.com> Message-ID: <4EA5A1F6-2950-4D65-A59C-0F3848C86C02@gmail.com> I?d pick a couple of species that are as closely related as you can find. Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won?t have (those databases are usually a little too conservative). The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with. Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point. This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics. Thanks, Carson > On Mar 24, 2015, at 9:05 AM, Panos Ioannidis wrote: > > Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site. > > I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this as a guide). > > Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence. > > P > > On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt > wrote: > On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it?s own? If it?s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it?s Drosophila which is the only species annotated correctly enough to use them). > > ?Carson > > >> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis > wrote: >> >> Hi Carson, >> >> So you think it's okay to include incomplete gene models when training Augustus? >> >> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%... >> >> Thanks, >> Panos >> >> >> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt > wrote: >> Hi Panos, >> >> EST?s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it?s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed. >> >> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors >> Here is an excellent explanation of Augustus training ?> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html >> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)?> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl >> >> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training. >> >> ?Carson >> >> >>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis > wrote: >>> >>> Hi Xabier, >>> >>> Thanks for your quick reply! >>> >>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it! >>> >>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker). >>> >>> P >>> >>> >>> >>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier V?zquez Campos > wrote: >>> Hi Panos, >>> >>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus. >>> >>> Cheers, >>> >>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis >: >>> Hello All, >>> >>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon). >>> >>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl " step), I get a warning for each gene that doesn't contain a start or stop codon. >>> >>> ..... >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg >>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? >>> .... >>> >>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set? >>> >>> Oh, and by the way, the best guide to retraining Augustus is here . The official web page isn't bad, but doesn't explain in detail certain things. >>> >>> Thanks, >>> Panos >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >>> >>> >>> >>> -- >>> Xabier V?zquez Campos >>> PhD Candidate >>> Water Research Centre >>> School of Civil and Environmental Engineering >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alicebdennis at gmail.com Thu Mar 26 04:34:26 2015 From: alicebdennis at gmail.com (Alice Dennis) Date: Thu, 26 Mar 2015 11:34:26 +0100 Subject: [maker-devel] iterative Maker2 In-Reply-To: <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hello again, I posted a while ago about a genome I'm running through the Maker2 pipeline. I was concerned because my results were still changing with 3 and 4 iterations. Following the very useful advice of Carson (below), I've made a few modifications (adding a RepeatModeler run, using a big protein database), but my gene predictions are still changing between the 3rd and 4th iterations. Perhaps this is ok, but these increasing gene lengths make me worry that I haven't built stable models. Here is the short version of what I've done. 1. Run RepeatModeler, but this only produced 47 sequences in the resulting .fasta... so that seemed a bit small. 2. Run Maker2 using: - RepeatModeler output + "model_org=all" and "softmask=1" in the Repeat Masking section. - protein evidence from 2 distantly related species AND all of Uniprot - ests from a different strain of my species (a parasitoid wasp) - the .hmm from Nasonia, one of the 2 distantly related species whose proteome I also provided as protein evidence - my assembled genome of 1,509 scaffolds. 3. After this, I did three subsequent rounds of Maker2 (cleverly named Rounds 2, 3 and 4). Each one used the same input, except the Nasonia .hmm was replaced by a SNAP generated .hmm from the previous round. Also, the est2genome and protein2genome was changed from 1 to 0 in all runs after the first. Here are some results: Round1: 14,647 genes, average length 2,491 Round2: 12,158 genes, average length 3,760 Round3: 13,515 genes, average length 3,090 Round4: 12,169 genes, average length 3,918 This is a bit confusing because the number of genes predicted goes up and down, as does their lengths. I've doubly checked the dates of my files, and they are all labeled such that I don't think anything could be swapped. So my questions are: Is this an indication that my models are unstable and I shouldn't trust these predictions? Is the decreasing number of genes, while also getting longer perhaps a good thing? How do I know when to stop if genes keep getting longer? Thanks very much, Alice On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > The gene models are actually produced by SNAP, Augustus, or whatever gene > predictor you are using, so if you change the HMM every round, then the > models will change too. But I have one concern. You are using a very > sparse protein evidence dataset. The protein dataset is very important to > MAKER?s performance, and for itterative training of the ab initio > predictors. Normally after the second iteration, additional training should > not be beneficial, but if you are getting wildly different results on 3rd > and 4th round, then you probably aren?t getting sufficient good models to > train with. > > For a protein dataset you should be using the entire a proteome from a > minimum of two related species and perhaps all of UniProt/Swiss-prot to get > a broad protein database. Don?t use the proteins extracted by CEGMA and > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff scrip > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > also the HaMSTr results will be redundant with the ESTs. You need proteins > from related species to look for homology not found in the EST dataset. > > Also repeat masking is important for any genome and has a huge effect on ab > initio predictor performance. Make sure you run something like > RepeatModeler to look for species specific repeats that will not already be > in RepBase. Then add those results to the rmlib= option in the maker > control files. > > Thanks, > Carson > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice wrote: > > Hi all, > > I am a relatively new user to Maker2, and I?m looking for advise on running > many iterations of the same dataset in Maker2. > > I have a relatively small genome (~124 MB) from a wasp that is assembled > into ~1,500 scaffold. I have run several iterations of Maker2 by > re-generating .hmms in SNAP and feeding them into the next round, and my > gene predictions keep increasing (in number and in size). The only thing > that changes at each round is the .hmm. > This is the evidence that I give is: > - de novo assembled ESTs from a different strain of the same > species (70,000 contigs? I am currently working on improving this assembly > with the hope that this will be helpful here) > - 610 proteins extracted from the genome scaffolds using CEGMA and > HaMSTr > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > est2genome/protein2genome option. > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > previous round, all without the est2genome/protein2genome option. All other > files are the same as in the original run. > > As I understand it, after the second round, nothing should change in Maker2. > But the differences are obvious between runs. Some entirely new exons are > annotated. For example, just counting ?exon? in the .gff file gives me > 73,000 after the third iteration and 96,000 after the fourth! Actually the > biggest leap in this number is between the third and fourth round. I can > also see that many features are longer when I look at the files in Geneious. > > Is this sort of change possible after the second round of Maker2? Is there > something I have done wrong in my runs, or am a understanding this output > incorrectly? > > Thank you, > Alice > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -- Alice Dennis alicebdennis at gmail.com Postdoctoral Researcher Institute for Integrative Biology, ETH Z?rich & EAWAG ?berlandstrasse 133 P.O. Box 611 8600 D?bendorf, Switzerland https://adennis5.wordpress.com/ From michael.s.campbell1 at gmail.com Thu Mar 26 09:50:41 2015 From: michael.s.campbell1 at gmail.com (Michael Campbell) Date: Thu, 26 Mar 2015 09:50:41 -0600 Subject: [maker-devel] iterative Maker2 In-Reply-To: References: <1FD5809847938F44B92893606806BD53600D845F@EE-MBX1.ee.emp-eaw.ch> <1c9847450c4142d3b7e2b81f893773f0@EE-HUB1.ee.emp-eaw.ch> Message-ID: Hi Alice, In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening. Mike On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis wrote: > Hello again, > > I posted a while ago about a genome I'm running through the Maker2 > pipeline. I was concerned because my results were still changing with > 3 and 4 iterations. > > Following the very useful advice of Carson (below), I've made a few > modifications (adding a RepeatModeler run, using a big protein > database), but my gene predictions are still changing between the 3rd > and 4th iterations. Perhaps this is ok, but these increasing gene > lengths make me worry that I haven't built stable models. > > Here is the short version of what I've done. > 1. Run RepeatModeler, but this only produced 47 sequences in the > resulting .fasta... so that seemed a bit small. > > 2. Run Maker2 using: > - RepeatModeler output + "model_org=all" and "softmask=1" in the > Repeat Masking section. > - protein evidence from 2 distantly related species AND all of Uniprot > - ests from a different strain of my species (a parasitoid wasp) > - the .hmm from Nasonia, one of the 2 distantly related species whose > proteome I also provided as protein evidence > - my assembled genome of 1,509 scaffolds. > > 3. After this, I did three subsequent rounds of Maker2 (cleverly named > Rounds 2, 3 and 4). Each one used the same input, except the Nasonia > .hmm was replaced by a SNAP generated .hmm from the previous round. > Also, the est2genome and protein2genome was changed from 1 to 0 in all > runs after the first. > > Here are some results: > Round1: 14,647 genes, average length 2,491 > Round2: 12,158 genes, average length 3,760 > Round3: 13,515 genes, average length 3,090 > Round4: 12,169 genes, average length 3,918 > > This is a bit confusing because the number of genes predicted goes up > and down, as does their lengths. I've doubly checked the dates of my > files, and they are all labeled such that I don't think anything could > be swapped. > > So my questions are: > Is this an indication that my models are unstable and I shouldn't > trust these predictions? > Is the decreasing number of genes, while also getting longer perhaps a > good thing? > How do I know when to stop if genes keep getting longer? > > > Thanks very much, > Alice > > > On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt wrote: > > The gene models are actually produced by SNAP, Augustus, or whatever gene > > predictor you are using, so if you change the HMM every round, then the > > models will change too. But I have one concern. You are using a very > > sparse protein evidence dataset. The protein dataset is very important > to > > MAKER?s performance, and for itterative training of the ab initio > > predictors. Normally after the second iteration, additional training > should > > not be beneficial, but if you are getting wildly different results on 3rd > > and 4th round, then you probably aren?t getting sufficient good models to > > train with. > > > > For a protein dataset you should be using the entire a proteome from a > > minimum of two related species and perhaps all of UniProt/Swiss-prot to > get > > a broad protein database. Don?t use the proteins extracted by CEGMA and > > HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff > scrip > > that comes with MAEKR), but don?t give the proteins to MAKER as evidence, > > also the HaMSTr results will be redundant with the ESTs. You need > proteins > > from related species to look for homology not found in the EST dataset. > > > > Also repeat masking is important for any genome and has a huge effect on > ab > > initio predictor performance. Make sure you run something like > > RepeatModeler to look for species specific repeats that will not already > be > > in RepBase. Then add those results to the rmlib= option in the maker > > control files. > > > > Thanks, > > Carson > > > > > > > > > > On Dec 12, 2014, at 7:10 AM, Dennis, Alice > wrote: > > > > Hi all, > > > > I am a relatively new user to Maker2, and I?m looking for advise on > running > > many iterations of the same dataset in Maker2. > > > > I have a relatively small genome (~124 MB) from a wasp that is assembled > > into ~1,500 scaffold. I have run several iterations of Maker2 by > > re-generating .hmms in SNAP and feeding them into the next round, and my > > gene predictions keep increasing (in number and in size). The only thing > > that changes at each round is the .hmm. > > This is the evidence that I give is: > > - de novo assembled ESTs from a different strain of the same > > species (70,000 contigs? I am currently working on improving this > assembly > > with the hope that this will be helpful here) > > - 610 proteins extracted from the genome scaffolds using CEGMA > and > > HaMSTr > > > > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the > > est2genome/protein2genome option. > > > > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the > > previous round, all without the est2genome/protein2genome option. All > other > > files are the same as in the original run. > > > > As I understand it, after the second round, nothing should change in > Maker2. > > But the differences are obvious between runs. Some entirely new exons are > > annotated. For example, just counting ?exon? in the .gff file gives me > > 73,000 after the third iteration and 96,000 after the fourth! Actually > the > > biggest leap in this number is between the third and fourth round. I can > > also see that many features are longer when I look at the files in > Geneious. > > > > Is this sort of change possible after the second round of Maker2? Is > there > > something I have done wrong in my runs, or am a understanding this output > > incorrectly? > > > > Thank you, > > Alice > > > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > > > > > -- > > > Alice Dennis > alicebdennis at gmail.com > > Postdoctoral Researcher > Institute for Integrative Biology, ETH Z?rich & EAWAG > ?berlandstrasse 133 > P.O. Box 611 > 8600 D?bendorf, Switzerland > > https://adennis5.wordpress.com/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Michael Campbell MS, RD. Doctoral Candidate Eccles Institute of Human Genetics University of Utah 15 North 2030 East, Room 2100 Salt Lake City, UT 84112-5330 ph:585-3543 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rens.holmer at wur.nl Mon Mar 30 00:12:20 2015 From: rens.holmer at wur.nl (Holmer, Rens) Date: Mon, 30 Mar 2015 06:12:20 +0000 Subject: [maker-devel] Incorporating cufflinks in maker Message-ID: <42E0168F-C672-4B7F-97D4-98442B825BF9@wur.nl> Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer From goutham.atla at gmail.com Fri Mar 27 23:37:08 2015 From: goutham.atla at gmail.com (Goutham atla) Date: Sat, 28 Mar 2015 11:07:08 +0530 Subject: [maker-devel] Annotating Cufflinks GTF with Maker Message-ID: Dear All, I have a draft genome for organism of my interest and I have around 150G of 100bp paired-end RNA-Seq data from different conditions. This organism has ensemble annotations but very few. My goal is to look at differential splicing analysis between two conditions. For this I need good annotations in gtf format at isoform level.I am interested in using the Splicing Analysis Kit For now, I have aligned one sample to genome using tophat2 and then used cufflinks to generate a de-novo GTF file. In either cases I have not used the avail be GTF with very few annotations. The GTF file generated by cufflinks should be annotated to know the function of each transcript. So I am interested in adding annotations to the gtf file generated from cufflinks. What is the best of doing it ? Or is there any better way of getting a gtf file, like that of ensemble, from my data ? I have looked at trinotate, but its more about functional annotation and expression studies. Regards, -- Goutham Atla -------------- next part -------------- An HTML attachment was scrubbed... URL: From avhoeck at SCKCEN.BE Mon Mar 30 10:11:16 2015 From: avhoeck at SCKCEN.BE (Van Hoeck Arne) Date: Mon, 30 Mar 2015 16:11:16 +0000 Subject: [maker-devel] comments on Incorporating cufflinks in maker Message-ID: <9BCA01D5BDC2AF46822CA182B4FBD0DF11826E1A@MAILSRV3.sck.be> Dear Rens and Carlson, I would like to comment on Rens' question on adding transcript data into your annotation. Since I have no access to google groups, I try to contact you both via mail with, hopefully, the correct mail addresses. I have tested Maker-P with multiple parameters, optimized 2 gene prediction tools (SNAP and augustus). Both give similarly results at which maker find around 17000 gene annotations. I also have RNAseq samples, and as a test case I also used Cufflinks output and processed with TransDecoder to select gene annotations. By using this approach, I end up with 25000 gene annotations. More or less, all the genes that Maker selected were also selected by the TransDecoder approach. So is it wise to include the information on the 8000 missing genes, based on optimized gene models? probably, there will some false positive in these 8000 missing genes, but with good criteria and models, it would maybe possible that maker can find more genes annotations. Best regards Arne Hi maker team, I am currently working on a project where we want to incorporate quite a lot of RNA-seq into our annotation. Currently I see two options: Provide the cufflinks output as EST-gff Process cufflinks output with TransDecoder (find ORFs, annotate UTR, etc) and provide this as either pred_gff or model_gff What would you suggest, and what would be the required formatting for both options? Thanks in advance, Rens Holmer [-] Consider the environment before you print Denk aan het milieu voor u deze e-mail print Pensez ? l'environnement avant d'imprimer [-] [-] SCK?CEN Disclaimer: http://www.sckcen.be/en/e-mail_disclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: