From xvazquezc at gmail.com Mon Oct 1 19:00:43 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Tue, 2 Oct 2018 10:00:43 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. > Although the results make sense in general, I have reasons to suspect some > gene models are wrong and would like your help in understanding and > optimizing the results. > My research project involves the annotation of multiple tomato varieties > (individuals) which are a bit different from the published reference > genome. To this end, I created de-novo assemblies of these genomes and also > generated an evidence set to be used as input for Maker. Evidence consist > of a large set of transcripts from various tomato varieties and conditions, > as well as full protein sets from 6 plant species, including the proteins > derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my > evidence data and Augustus as gene predictor. This should allow me to > compare my result to the ITAG annotation, which I assume to be the > "correct" answer, and see how well I'm doing. I should mention that ITAG > annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. > Specifically, I ran an all-vs-all blast and took the top hits. I discovered > that only about 70% of the ITAG proteins are covered by a protein from my > result with a high quality alignment (evalue > 10e-5, coverage > 90%). I > further investigated by running BUSCO on both protein sets and looking at > BUSCOs found in ITAG but missing in my result. Attached is a screenshot > from a genome browser where you can see such a case. Top track is the ITAG > gene model, below is my result. Third track is the protein evidence > alignments (i.e blastx and protein2genome features), and bottom track are > masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult > case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my > result. This is in fact the reason I ended up with a truncated protein and > a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of > protein evidence supporting this region as a CDS. Can you help me figure > out why is the result so? Could it be due to the small repeats detected in > this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Tue Oct 2 01:50:32 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Tue, 2 Oct 2018 09:50:32 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? :? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in > your genome is extremely important, specially in genomes with a lot of > repeats. If the repeat library does not have an appropriate coverage, > anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a > huge impact. Aside of the repeats, BUSCO-generated gene models can > under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can > be very different if you provide an haploid assembly vs haploid + alternate > contigs vs diploid. In general, a purely haploid assembly generates a less > biased model as it has lower number of duplicated conserved genes present, > that will unbalance the gene model towards them. (at least in BUSCO-based > models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis > and may require more than a bit of curation... usually increasing with more > complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. >> Although the results make sense in general, I have reasons to suspect some >> gene models are wrong and would like your help in understanding and >> optimizing the results. >> My research project involves the annotation of multiple tomato varieties >> (individuals) which are a bit different from the published reference >> genome. To this end, I created de-novo assemblies of these genomes and also >> generated an evidence set to be used as input for Maker. Evidence consist >> of a large set of transcripts from various tomato varieties and conditions, >> as well as full protein sets from 6 plant species, including the proteins >> derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my >> evidence data and Augustus as gene predictor. This should allow me to >> compare my result to the ITAG annotation, which I assume to be the >> "correct" answer, and see how well I'm doing. I should mention that ITAG >> annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. >> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >> that only about 70% of the ITAG proteins are covered by a protein from my >> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >> further investigated by running BUSCO on both protein sets and looking at >> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >> from a genome browser where you can see such a case. Top track is the ITAG >> gene model, below is my result. Third track is the protein evidence >> alignments (i.e blastx and protein2genome features), and bottom track are >> masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult >> case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >> my result. This is in fact the reason I ended up with a truncated protein >> and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of >> protein evidence supporting this region as a CDS. Can you help me figure >> out why is the result so? Could it be due to the small repeats detected in >> this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Oct 2 23:39:40 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 3 Oct 2018 14:39:40 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the > ITAG annotation as repeats library, so I expect these to be quite > appropriate. I guess my question is regarding the way Maker makes > decisions: Is the fact that some repeats (simple repeats in this case) were > predicted is enough to change a CDS into a UTR, despite sufficient protein > evidence? > I did not train Augustus myself, rather I used the species (tomato) > profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with > that since cultivated tomato varieties are repeatedly selfed, so they are > (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? xvazquezc at gmail.com??>:? > >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in >> your genome is extremely important, specially in genomes with a lot of >> repeats. If the repeat library does not have an appropriate coverage, >> anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a >> huge impact. Aside of the repeats, BUSCO-generated gene models can >> under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus >> can be very different if you provide an haploid assembly vs haploid + >> alternate contigs vs diploid. In general, a purely haploid assembly >> generates a less biased model as it has lower number of duplicated >> conserved genes present, that will unbalance the gene model towards them. >> (at least in BUSCO-based models, but it should be extensible to any >> Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis >> and may require more than a bit of curation... usually increasing with more >> complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >> >>> Hi MAKER users, >>> I am new to Maker and had just finished running my first annotations. >>> Although the results make sense in general, I have reasons to suspect some >>> gene models are wrong and would like your help in understanding and >>> optimizing the results. >>> My research project involves the annotation of multiple tomato varieties >>> (individuals) which are a bit different from the published reference >>> genome. To this end, I created de-novo assemblies of these genomes and also >>> generated an evidence set to be used as input for Maker. Evidence consist >>> of a large set of transcripts from various tomato varieties and conditions, >>> as well as full protein sets from 6 plant species, including the proteins >>> derived from the annotation of the reference - called ITAG. >>> For an initial QA, I tried annotating the reference genome using my >>> evidence data and Augustus as gene predictor. This should allow me to >>> compare my result to the ITAG annotation, which I assume to be the >>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>> annotation was also created using Maker, followed by manual curation. >>> I started by comparing the protein sets from my result and the ITAT set. >>> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >>> that only about 70% of the ITAG proteins are covered by a protein from my >>> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >>> further investigated by running BUSCO on both protein sets and looking at >>> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >>> from a genome browser where you can see such a case. Top track is the ITAG >>> gene model, below is my result. Third track is the protein evidence >>> alignments (i.e blastx and protein2genome features), and bottom track are >>> masked repeats. >>> As you can see, there seems to be two issues with my result: >>> 1. The two genes in ITAG were fused into one. I guess this is a >>> difficult case as the genes are really close together. >>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>> my result. This is in fact the reason I ended up with a truncated protein >>> and a missing BUSCO. >>> This is a bit surprising to me, since there seems to be quite a lot of >>> protein evidence supporting this region as a CDS. Can you help me figure >>> out why is the result so? Could it be due to the small repeats detected in >>> this region? >>> Any ideas on how my result can be improved without manual curation? >>> >>> Many thanks! >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> -- >> Xabier V?zquez-Campos, *PhD* >> *Research Associate* >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 18:52:47 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 17:52:47 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? > I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. > My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From myandell at genetics.utah.edu Thu Oct 4 19:05:04 2018 From: myandell at genetics.utah.edu (Mark Yandell) Date: Fri, 5 Oct 2018 00:05:04 +0000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Cheers! From: maker-devel on behalf of Carson Holt Date: Thursday, October 4, 2018 at 5:52 PM To: Lior Glick Cc: Maker Mailing List Subject: Re: [maker-devel] Help debugging a MAKER result I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos? ?>: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: Hi MAKER users, I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. As you can see, there seems to be two issues with my result: 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? Any ideas on how my result can be improved without manual curation? Many thanks! _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 19:09:58 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 18:09:58 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: One correction. I meant to say set unmask=1. ?Carson > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. > > ?Carson > > >> On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: >> >> Yeah, tomato should be rather well annotated. >> >> I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things >> >> You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. >> >> I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. >> >> On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation >> >> >> On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? >> I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. >> My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Fri Oct 5 01:51:41 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Fri, 5 Oct 2018 09:51:41 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Thank you both for your helpful ideas. I'm going to give them a try and see how this effects my results. Will update when I have them. Cheers indeed. ??????? ??? ??, 5 ????? 2018 ?-3:10 ??? ?Carson Holt?? :? > One correction. I meant to say set unmask=1. > > ?Carson > > > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself > does not generate models. In your case, Augustus produces the models. > Augustus will run twice. Once on it?s own (this will be on a repeat masked > version of the assembly), and once again where MAKER provides it with a > hints file as part of the command line used to run Augustus. The hints file > is generated from the evidence alignments you provided to MAKER. The hints > usually get Augustus to perform a little better than it does with training > alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. > MAKER hard masks complex repeats in the assembly (turns them from ATCG into > N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The > lower case ?soft-masking? affects BLAST alignment but not Augustus > predictions (Augustus ignores it). MAKER also removes the hard-masking when > it runs Augustus with the hints file. This is done because we?ve > constrained Augustus to a smaller padded evidence cluster at the locus, and > Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can > set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked > assembly). You can then look at contigs in a browser to see how the masked > vs unmasked models compare to each other. > > ?Carson > > > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the > creation of the gene model. Also, creating a new Augustus model based on > the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I > would still run the advanced repeat library protocol, if it isn't > computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to > try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like > RATT. This would annotate all the common regions with the reference and > then curate only on the regions that cannot be annotated from the reference > using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the >> ITAG annotation as repeats library, so I expect these to be quite >> appropriate. I guess my question is regarding the way Maker makes >> decisions: Is the fact that some repeats (simple repeats in this case) were >> predicted is enough to change a CDS into a UTR, despite sufficient protein >> evidence? >> I did not train Augustus myself, rather I used the species (tomato) >> profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with >> that since cultivated tomato varieties are repeatedly selfed, so they are >> (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? > xvazquezc at gmail.com??>:? >> >>> Hi Lior, >>> >>> without getting in a lot of detail a good model covering the repeats in >>> your genome is extremely important, specially in genomes with a lot of >>> repeats. If the repeat library does not have an appropriate coverage, >>> anything based on the masked genome will be affected >>> >>> The evidence you pass into Augustus to generate the gene model can have >>> a huge impact. Aside of the repeats, BUSCO-generated gene models can >>> under-predict >>> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >>> And we have seen in our lab that the gene models generated by Augustus >>> can be very different if you provide an haploid assembly vs haploid + >>> alternate contigs vs diploid. In general, a purely haploid assembly >>> generates a less biased model as it has lower number of duplicated >>> conserved genes present, that will unbalance the gene model towards them. >>> (at least in BUSCO-based models, but it should be extensible to any >>> Augustus model) >>> >>> Note that in the end the generated annotation is just a model/hypothesis >>> and may require more than a bit of curation... usually increasing with more >>> complex genomes. >>> >>> Cheers, >>> Xabi >>> >>> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >>> >>>> Hi MAKER users, >>>> I am new to Maker and had just finished running my first annotations. >>>> Although the results make sense in general, I have reasons to suspect some >>>> gene models are wrong and would like your help in understanding and >>>> optimizing the results. >>>> My research project involves the annotation of multiple tomato >>>> varieties (individuals) which are a bit different from the published >>>> reference genome. To this end, I created de-novo assemblies of these >>>> genomes and also generated an evidence set to be used as input for Maker. >>>> Evidence consist of a large set of transcripts from various tomato >>>> varieties and conditions, as well as full protein sets from 6 plant >>>> species, including the proteins derived from the annotation of the >>>> reference - called ITAG. >>>> For an initial QA, I tried annotating the reference genome using my >>>> evidence data and Augustus as gene predictor. This should allow me to >>>> compare my result to the ITAG annotation, which I assume to be the >>>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>>> annotation was also created using Maker, followed by manual curation. >>>> I started by comparing the protein sets from my result and the ITAT >>>> set. Specifically, I ran an all-vs-all blast and took the top hits. I >>>> discovered that only about 70% of the ITAG proteins are covered by a >>>> protein from my result with a high quality alignment (evalue > 10e-5, >>>> coverage > 90%). I further investigated by running BUSCO on both protein >>>> sets and looking at BUSCOs found in ITAG but missing in my result. Attached >>>> is a screenshot from a genome browser where you can see such a case. Top >>>> track is the ITAG gene model, below is my result. Third track is the >>>> protein evidence alignments (i.e blastx and protein2genome features), and >>>> bottom track are masked repeats. >>>> As you can see, there seems to be two issues with my result: >>>> 1. The two genes in ITAG were fused into one. I guess this is a >>>> difficult case as the genes are really close together. >>>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>>> my result. This is in fact the reason I ended up with a truncated protein >>>> and a missing BUSCO. >>>> This is a bit surprising to me, since there seems to be quite a lot of >>>> protein evidence supporting this region as a CDS. Can you help me figure >>>> out why is the result so? Could it be due to the small repeats detected in >>>> this region? >>>> Any ideas on how my result can be improved without manual curation? >>>> >>>> Many thanks! >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> -- >>> Xabier V?zquez-Campos, *PhD* >>> *Research Associate* >>> NSW Systems Biology Initiative >>> School of Biotechnology and Biomolecular Sciences >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 5 15:37:34 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 5 Oct 2018 14:37:34 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: Message-ID: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> I tried setting this up but there are a number of issues I run into. First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> RepeatMasker.lib RepeatMasker.lib.nhr RepeatMasker.lib.nin RepeatMasker.lib.nsq RepeatMaskerLib.embl But they do not exist in the share directory. Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. Another work around is don?t use OpenMPI. Try MPICH3. ?Carson > On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau wrote: > > Hi, > > I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: > > STATUS: Processing and indexing input FASTA files... > [cl1n022:06306] *** Process received signal *** > [cl1n022:06306] Signal: Segmentation fault (11) > [cl1n022:06306] Signal code: Address not mapped (1) > [cl1n022:06306] Failing at address: 0x514 > [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] > [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] > [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] > [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] > [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] > [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] > [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] > [cl1n022:06306] *** End of error message *** > SIGTERM received > SIGTERM received > > > As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. > > As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. > > Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe > Any help would be highly appreciated! > > Anthony Bretaudeau > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 11:34:22 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 16:34:22 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> Message-ID: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 12:31:04 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 17:31:04 +0000 Subject: [maker-devel] maker problem In-Reply-To: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> Message-ID: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 12:45:31 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 17:45:31 +0000 Subject: [maker-devel] maker problem In-Reply-To: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 13:08:49 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 18:08:49 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 14:12:27 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 19:12:27 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: ok, let me explain my case. Genome- eukaryote We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. atleast=transcripts.fasta (from in-house sequenced genome (already published)) est2genome=1 protein2genome=1 Sorry for not explaining my case initially. What can be other files I can use as est evidence? Can I use Augustus generated hints for gene prediction along with above options? Your thoughts?? Parul On Oct 8, 2018, at 1:08 PM, Carson Hinton Holt > wrote: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 8 15:11:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 8 Oct 2018 14:11:26 -0600 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> > We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. > Transcripts- > We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > Proteins- > I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. > atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson From liorglic at mail.tau.ac.il Wed Oct 17 09:27:06 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Wed, 17 Oct 2018 17:27:06 +0300 Subject: [maker-devel] Problem compiling MAKER with Intel MPI Message-ID: Hello, I am trying to compile MAKER with Intel MPI. We are using a cluster based on Intel x86_64 architecture and using lmod for environment variables. All required dependencies have already been installed and the initial 'perl Build.PL' passes without issues (see attached). When running './Build install' it always fails to find 'sys/types.h' and exits (see additional attachment). The Build command probably searches for the '/usr/include/sys/types.h' file, but no matter which variable (INCLUDE, PERL5LIB etc...) I update with the required path (either '/usr/include' or '/usr/include/sys') - it keeps failing. I would appreciate your input. Thanks a lot! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build.PL.out Type: application/octet-stream Size: 2032 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build_install.out Type: application/octet-stream Size: 6312 bytes Desc: not available URL: From anthony.bretaudeau at inria.fr Thu Oct 18 08:52:03 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Thu, 18 Oct 2018 15:52:03 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 15:40:06 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 20:40:06 +0000 Subject: [maker-devel] maker problem In-Reply-To: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> Message-ID: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. I used est_fasta not the est_gff. Find a contig with protein2genome results in the GFF3 yes I can see protein2genome results in gff3: ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; and est2genome in gff3 as well: ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; Thanks, Parul On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Fri Oct 12 03:23:07 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Fri, 12 Oct 2018 10:23:07 +0200 Subject: [maker-devel] maker-level google group Message-ID: Dear maker team, I hope this email finds you well. I am a member of the maker-devel google group, but, somehow, I cannot post questions. Is there anything I can do on my end to fix this? Also, I was wondering where can I download maker3 (I cannot seem to find it online). I have been using maker2, but I wanted to use EVM, and I have read that maker3 implements it. Thank you so much for your help, Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From yli at utexas.edu Tue Oct 16 23:49:13 2018 From: yli at utexas.edu (Yiyuan Li) Date: Tue, 16 Oct 2018 23:49:13 -0500 Subject: [maker-devel] Speed up maker annotation on long scaffolds Message-ID: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Dear Maker support, I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? Thank you! YY From peachandolives at gmail.com Thu Oct 18 03:29:57 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Thu, 18 Oct 2018 10:29:57 +0200 Subject: [maker-devel] maker3 Message-ID: Dear maker team, I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. I would appreciate any help. Thank you very much! Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 12:02:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:02:22 -0600 Subject: [maker-devel] maker problem In-Reply-To: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> Message-ID: <3F78E884-11AF-4291-A8FC-D81F6F55B47D@gmail.com> Once Augustus is trained it will have a new species directory under ?/augustus/config/species/ for the organism you just trained. Or if you trained augustus elsewhere (website, BUSCO, etc.) you have to copy the species data there. Then you just supply the species name and Augustus automatically finds it (see Augustus documentation on training). For est2genome=1 and protein2genome=1, MAKER takes the alignments from exonerate protein2genome and est2genome and if they are mostly open reading frame, just turns them directly into gene/mRNA/exon/CDS models. If there are none of those in the resulting GFF3 but there are est2genome and protein2genome alignments then all of them have broken ORF. That means there are serious issues with your assembly, or with the est fasta or protein fasta file. For a protein fasta, I recomend using uniprot/swissprot because it is manually curated and contains a broad dataset. But if you cannot get gene models from uniprot/swissprot protein2genome alignments, then your assembly has issues (either too fragmented, lots of errors inducing random stop codons, or lots of N?s interspersed in the sequence). ?Carson > On Oct 8, 2018, at 2:40 PM, Gupta, Parul wrote: > > I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. > >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > > I used est_fasta not the est_gff. > >> Find a contig with protein2genome results in the GFF3 > > yes I can see protein2genome results in gff3: > > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 31566 32621 > 1426 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31566 > 31775 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31872 > 32621 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 33816 35829 > 1394 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 34916 > 35829 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 33816 > 34182 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 49636 51466 > 1091 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 51354 > 51466 1091 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; > > and est2genome in gff3 as well: > > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16239 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889982 > 48890708 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16412 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889949 > 48890708 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48895479 48899036 > 9582 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; > > Thanks, > Parul > >> On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: >> >> >>> We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. >> >> Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. >> >> >>> Transcripts- >>> We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. >> >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. >> >> >>> Proteins- >>> I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. >> >> Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. >> >> >>> atleast=transcripts.fasta (from in-house sequenced genome (already published)) >> >> These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). >> >> ?Carson >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 12:09:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:09:30 -0600 Subject: [maker-devel] Speed up maker annotation on long scaffolds In-Reply-To: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> References: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Message-ID: <28BAD1D1-77BA-4F50-A54F-7E402589E76F@gmail.com> You might not have MPI setup correctly. MPI spread across 10 machines (20 cores each) can annotate an entire maize chromosome in ~20 minutes. A few tests. #this command should print all the hosts you are running MPI on and how many cores on each host. If you don?t see multiple hosts you are not spreading across machines. mpiexec hostname | sort | uniq -c #this will let you know if maker is running MPI correctly (should print help message only once) mpiexec maker -h ?Carson > On Oct 16, 2018, at 10:49 PM, Yiyuan Li wrote: > > Dear Maker support, > I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? > > Thank you! > > YY > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Oct 19 12:22:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:22:12 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. ?Carson > On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: > > Hi, > > I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. > > After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). > > This + the backtrace shows that it has something to do with signal handling when running child process from threads. > > After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments > It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. > > > > Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 > I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). > > I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... > > I guess it's the same question for te_proteins. > > > > Cheers > > Anthony > > > > > > Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >> I tried setting this up but there are a number of issues I run into. >> >> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >> RepeatMasker.lib >> RepeatMasker.lib.nhr >> RepeatMasker.lib.nin >> RepeatMasker.lib.nsq >> RepeatMaskerLib.embl >> >> But they do not exist in the share directory. >> >> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >> >> >> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >> >> Another work around is don?t use OpenMPI. Try MPICH3. >> >> >> ?Carson >> >> >> >> >> >>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>> >>> Hi, >>> >>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>> >>> STATUS: Processing and indexing input FASTA files... >>> [cl1n022:06306] *** Process received signal *** >>> [cl1n022:06306] Signal: Segmentation fault (11) >>> [cl1n022:06306] Signal code: Address not mapped (1) >>> [cl1n022:06306] Failing at address: 0x514 >>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>> [cl1n022:06306] *** End of error message *** >>> SIGTERM received >>> SIGTERM received >>> >>> >>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>> >>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>> >>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>> Any help would be highly appreciated! >>> >>> Anthony Bretaudeau >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 12:25:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:25:40 -0600 Subject: [maker-devel] maker3 In-Reply-To: References: Message-ID: <1D30ACCC-1DC4-451E-8553-8AB8ADA269A2@gmail.com> The maker 3 beta is one of the links when you registre to download maker. IT will be the link directly under the stable release link ?> http://yandell.topaz.genetics.utah.edu/cgi-bin/maker_license.cgi Also you can use grep to pull out specific lines of a gff3 file. Example: grep -P "\tprotein2genome\t" all.gff > protein2genome.gff That command will grab all the protein2genome features out of a file. ?Carson > On Oct 18, 2018, at 2:29 AM, Linnie Linnie wrote: > > Dear maker team, > > I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. > > Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. > > I would appreciate any help. Thank you very much! > > Linnie > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Tue Oct 23 08:56:09 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Tue, 23 Oct 2018 15:56:09 +0200 Subject: [maker-devel] CIGAR string explanation Message-ID: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Hello, Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ -- completed exonerate analysis and here the result we get in the protein2genome.gff output from MAKER @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. Could you explain their meanings ? Best regards, /Jacques ------------------------------------------------- Jacques Dainat, Ph.D. NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service http://nbis.se/about/staff/jacques-dainat http://nbis.se ? Contact ? Address: Uppsala University, Biomedicinska Centrum Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: +46 18 471 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 23 10:55:51 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 23 Oct 2018 09:55:51 -0600 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Message-ID: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. ?Carson > On Oct 23, 2018, at 7:56 AM, Jacques Dainat wrote: > > Hello, > > Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) > > cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 > vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ > -- completed exonerate analysis > > > and here the result we get in the protein2genome.gff output from MAKER > > @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 > @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 > @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 > @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 > @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 > @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 > @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 > @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 > @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 > @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 > @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 > @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 > @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 > @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 > > MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. > Could you explain their meanings ? > > Best regards, > > /Jacques > ------------------------------------------------- > Jacques Dainat, Ph.D. > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > http://nbis.se/about/staff/jacques-dainat > http://nbis.se > > ? Contact ? > Address: Uppsala University, Biomedicinska Centrum > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: +46 18 471 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Wed Oct 24 04:28:52 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Wed, 24 Oct 2018 05:28:52 -0400 Subject: [maker-devel] EVM control file and est2genome Message-ID: Hi, I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. The first one is the following: I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. Does this mean I can only use est2genome with ESTs from the species of interest? The second error relates to EVM: I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: >ERROR: You have failed to provide a value for 'evm' in the control files. Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. I would appreciate any input on any of these issues. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at inria.fr Wed Oct 24 10:07:48 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Wed, 24 Oct 2018 17:07:48 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 10:46:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:46:30 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: <62EBFA6C-4194-4D65-8313-F67EFCAEF47A@gmail.com> It divides up pieces of contigs as well as individual steps. BLAST, exonerate, snap, augustus can each run on separate machines. ?Carson > On Oct 24, 2018, at 9:07 AM, Anthony Bretaudeau wrote: > > Hi, > > I'll see if I can improve the conda recipe. > > Just one simple question: how does Maker divide the work between worker nodes in mpi mode? Is it supposed to be 1 contig per node or are the largest contig splitted into smaller chunks, each one potentially treated on different nodes? From my tests I have the feeling it is the first answer, but I'm not sure if it's normal or not. > > Anthony > > Le 19/10/2018 ? 19:22, Carson Holt a ?crit : >> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. >> >> For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> >> >> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb >> >> te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. >> >> ?Carson >> >> >> >> >>> On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: >>> >>> Hi, >>> >>> I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. >>> >>> After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). >>> >>> This + the backtrace shows that it has something to do with signal handling when running child process from threads. >>> >>> After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments >>> It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. >>> >>> >>> >>> Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 >>> I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). >>> >>> I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... >>> >>> I guess it's the same question for te_proteins. >>> >>> >>> >>> Cheers >>> >>> Anthony >>> >>> >>> >>> >>> >>> Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >>>> I tried setting this up but there are a number of issues I run into. >>>> >>>> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >>>> RepeatMasker.lib >>>> RepeatMasker.lib.nhr >>>> RepeatMasker.lib.nin >>>> RepeatMasker.lib.nsq >>>> RepeatMaskerLib.embl >>>> >>>> But they do not exist in the share directory. >>>> >>>> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >>>> >>>> >>>> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >>>> >>>> Another work around is don?t use OpenMPI. Try MPICH3. >>>> >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> >>>>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>>>> >>>>> STATUS: Processing and indexing input FASTA files... >>>>> [cl1n022:06306] *** Process received signal *** >>>>> [cl1n022:06306] Signal: Segmentation fault (11) >>>>> [cl1n022:06306] Signal code: Address not mapped (1) >>>>> [cl1n022:06306] Failing at address: 0x514 >>>>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>>>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>>>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>>>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>>>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>>>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>>>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>>>> [cl1n022:06306] *** End of error message *** >>>>> SIGTERM received >>>>> SIGTERM received >>>>> >>>>> >>>>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>>>> >>>>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>>>> >>>>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>>>> Any help would be highly appreciated! >>>>> >>>>> Anthony Bretaudeau >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 10:50:43 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:50:43 -0600 Subject: [maker-devel] EVM control file and est2genome In-Reply-To: References: Message-ID: <3CB0FDB0-8B7D-4CF8-B957-5935166D5305@gmail.com> est2genome only works with the data given to est=. For the second error, you must provide the path of the evm executable in maker_exe.ctl. It apparently was not in your PATH, so it didn?t get automatically filled out. Here is an example from the wiki of using est2genome and protein2genome to train SNAP for the next MAKER run ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Oct 24, 2018, at 3:28 AM, Linnie Linnie wrote: > > Hi, > > I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. > > The first one is the following: > I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: > > >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. > > Does this mean I can only use est2genome with ESTs from the species of interest? > > The second error relates to EVM: > I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: > > >ERROR: You have failed to provide a value for 'evm' in the control files. > > Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? > > > And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. > > I would appreciate any input on any of these issues. > > Thank you! > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Wed Oct 24 03:41:05 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Wed, 24 Oct 2018 10:41:05 +0200 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Message-ID: Thanks for your response. It?s surprising the link in the Sequence Ontology web site doesn?t work anymore. I will notify them. I was surprise that I was not able finding any resource on internet describing these values. Helped by your answer I have refined my key words and googled again, and I finnaly found old ressources describing that too. from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal I put a copy here of the Wormbase description in case those resources also disappear. At that time it sounds it was not yet officialy accepted by the SO. /Jacques > On 23 Oct 2018, at 17:55, Carson Holt wrote: > > Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. > > ?Carson > > >> On Oct 23, 2018, at 7:56 AM, Jacques Dainat > wrote: >> >> Hello, >> >> Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) >> >> cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 >> vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ >> -- completed exonerate analysis >> >> >> and here the result we get in the protein2genome.gff output from MAKER >> >> @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 >> @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 >> @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 >> @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 >> @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 >> @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 >> @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 >> @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 >> @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 >> @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 >> @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 >> @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 >> @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 >> @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 >> >> MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. >> Could you explain their meanings ? >> >> Best regards, >> >> /Jacques >> ------------------------------------------------- >> Jacques Dainat, Ph.D. >> NBIS (National Bioinformatics Infrastructure Sweden) >> Genome Annotation Service >> http://nbis.se/about/staff/jacques-dainat >> http://nbis.se >> >> ? Contact ? >> Address: Uppsala University, Biomedicinska Centrum >> Department of Medical Biochemistry Microbiology, Genomics >> Husargatan 3, box 582 >> S-75123 Uppsala Sweden >> Phone: +46 18 471 46 25 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2018-10-24 at 10.00.41.png Type: image/png Size: 281561 bytes Desc: not available URL: From elyssa_garza at yahoo.com Wed Oct 24 16:27:50 2018 From: elyssa_garza at yahoo.com (Elyssa Garza) Date: Wed, 24 Oct 2018 21:27:50 +0000 (UTC) Subject: [maker-devel] Is gene retrieval from gff possible? In-Reply-To: <1576161756.398305.1540414096080@mail.yahoo.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <1576161756.398305.1540414096080@mail.yahoo.com> Message-ID: <1888825059.421524.1540416470195@mail.yahoo.com> Hello I recently annotated my plant genome and am looking at retrieving a particular set of genes from the maker results. I have a list of TAIR Ids that I am particularly interested in and was thinking about using the gff file to help pull out the associated transcripts. I was wondering if you could advise me on the best or easiest way of obtaining the associated TAIR accession or gene model from the gff file. I did try looking at the genes (41,779 genes) using CLCbio but the accessions were not easily identified or found. I also looked at the protein matches (819,805 protein matches) and was able to easily find gene model matches corresponding to my target accessions. Is it wise to do this? Can you explain why I can't find these same protein matches in the gene file? I have some ideas on why this is happening but I am looking for support for them. Elyssa -------------- next part -------------- An HTML attachment was scrubbed... URL: From pallavi.gupta at slu.edu Thu Oct 25 16:22:31 2018 From: pallavi.gupta at slu.edu (Pallavi Gupta) Date: Thu, 25 Oct 2018 21:22:31 +0000 Subject: [maker-devel] Issue with maker Message-ID: Hi Team MAKER, I am using maker for my research for genome annotation process. But when I run maker I am getting a weird error. I tried finding a work around on the internet by scrolling through various bioinformatics forum but I was unsuccessful. I will really appreciate if you can help me in this regard. I have attached my nohup.out log. Please let me know if you need anything else. Thanks, Pallavi Gupta -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nohup.out Type: application/octet-stream Size: 26365432 bytes Desc: nohup.out URL: From 17na34 at queensu.ca Wed Oct 31 09:27:44 2018 From: 17na34 at queensu.ca (Nikolay Alabi) Date: Wed, 31 Oct 2018 14:27:44 +0000 Subject: [maker-devel] MAKER not running properly after installation, help needed Message-ID: Hello, I am attempting to annotate a garlic mustard genome using maker on a cluster at Queen?s University. I have been following the tutorial on wiki and was attempting to use the practice data to see if the program is running properly and to learn how to train the gene predicting system. Maker is now installed and is working to an extent, however when in use it is not working properly and cannot read/annotate a genome. I suspect two problems that is causing this to occur, first, anytime any maker command is called, it shows that an argument in forks.pm in perl5 is not correct, after trying to fix the problem, I see that the code should be correct, but the error line still occurs. Then every time a maker command Is called another error saying there is an error flow occurring somewhere in perl again. For instance when I command: maker -h, or maker -CTL or anything to do with maker, the error lines occur. Would you advise me to reinstall perl and bioperl? Other than that I believe everything else is properly installed and I do not understand why the program is not running properly. I have even tried using different data genomes, however the same problem occurs of the run never finishing, then retrying, and ultimately failing. Please let me know if there is another possible source of error in the problem. Best regards, Nikolay -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Oct 1 18:00:43 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Tue, 2 Oct 2018 10:00:43 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. > Although the results make sense in general, I have reasons to suspect some > gene models are wrong and would like your help in understanding and > optimizing the results. > My research project involves the annotation of multiple tomato varieties > (individuals) which are a bit different from the published reference > genome. To this end, I created de-novo assemblies of these genomes and also > generated an evidence set to be used as input for Maker. Evidence consist > of a large set of transcripts from various tomato varieties and conditions, > as well as full protein sets from 6 plant species, including the proteins > derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my > evidence data and Augustus as gene predictor. This should allow me to > compare my result to the ITAG annotation, which I assume to be the > "correct" answer, and see how well I'm doing. I should mention that ITAG > annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. > Specifically, I ran an all-vs-all blast and took the top hits. I discovered > that only about 70% of the ITAG proteins are covered by a protein from my > result with a high quality alignment (evalue > 10e-5, coverage > 90%). I > further investigated by running BUSCO on both protein sets and looking at > BUSCOs found in ITAG but missing in my result. Attached is a screenshot > from a genome browser where you can see such a case. Top track is the ITAG > gene model, below is my result. Third track is the protein evidence > alignments (i.e blastx and protein2genome features), and bottom track are > masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult > case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my > result. This is in fact the reason I ended up with a truncated protein and > a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of > protein evidence supporting this region as a CDS. Can you help me figure > out why is the result so? Could it be due to the small repeats detected in > this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Tue Oct 2 00:50:32 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Tue, 2 Oct 2018 09:50:32 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? :? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in > your genome is extremely important, specially in genomes with a lot of > repeats. If the repeat library does not have an appropriate coverage, > anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a > huge impact. Aside of the repeats, BUSCO-generated gene models can > under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can > be very different if you provide an haploid assembly vs haploid + alternate > contigs vs diploid. In general, a purely haploid assembly generates a less > biased model as it has lower number of duplicated conserved genes present, > that will unbalance the gene model towards them. (at least in BUSCO-based > models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis > and may require more than a bit of curation... usually increasing with more > complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. >> Although the results make sense in general, I have reasons to suspect some >> gene models are wrong and would like your help in understanding and >> optimizing the results. >> My research project involves the annotation of multiple tomato varieties >> (individuals) which are a bit different from the published reference >> genome. To this end, I created de-novo assemblies of these genomes and also >> generated an evidence set to be used as input for Maker. Evidence consist >> of a large set of transcripts from various tomato varieties and conditions, >> as well as full protein sets from 6 plant species, including the proteins >> derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my >> evidence data and Augustus as gene predictor. This should allow me to >> compare my result to the ITAG annotation, which I assume to be the >> "correct" answer, and see how well I'm doing. I should mention that ITAG >> annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. >> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >> that only about 70% of the ITAG proteins are covered by a protein from my >> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >> further investigated by running BUSCO on both protein sets and looking at >> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >> from a genome browser where you can see such a case. Top track is the ITAG >> gene model, below is my result. Third track is the protein evidence >> alignments (i.e blastx and protein2genome features), and bottom track are >> masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult >> case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >> my result. This is in fact the reason I ended up with a truncated protein >> and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of >> protein evidence supporting this region as a CDS. Can you help me figure >> out why is the result so? Could it be due to the small repeats detected in >> this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Oct 2 22:39:40 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 3 Oct 2018 14:39:40 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the > ITAG annotation as repeats library, so I expect these to be quite > appropriate. I guess my question is regarding the way Maker makes > decisions: Is the fact that some repeats (simple repeats in this case) were > predicted is enough to change a CDS into a UTR, despite sufficient protein > evidence? > I did not train Augustus myself, rather I used the species (tomato) > profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with > that since cultivated tomato varieties are repeatedly selfed, so they are > (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? xvazquezc at gmail.com??>:? > >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in >> your genome is extremely important, specially in genomes with a lot of >> repeats. If the repeat library does not have an appropriate coverage, >> anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a >> huge impact. Aside of the repeats, BUSCO-generated gene models can >> under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus >> can be very different if you provide an haploid assembly vs haploid + >> alternate contigs vs diploid. In general, a purely haploid assembly >> generates a less biased model as it has lower number of duplicated >> conserved genes present, that will unbalance the gene model towards them. >> (at least in BUSCO-based models, but it should be extensible to any >> Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis >> and may require more than a bit of curation... usually increasing with more >> complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >> >>> Hi MAKER users, >>> I am new to Maker and had just finished running my first annotations. >>> Although the results make sense in general, I have reasons to suspect some >>> gene models are wrong and would like your help in understanding and >>> optimizing the results. >>> My research project involves the annotation of multiple tomato varieties >>> (individuals) which are a bit different from the published reference >>> genome. To this end, I created de-novo assemblies of these genomes and also >>> generated an evidence set to be used as input for Maker. Evidence consist >>> of a large set of transcripts from various tomato varieties and conditions, >>> as well as full protein sets from 6 plant species, including the proteins >>> derived from the annotation of the reference - called ITAG. >>> For an initial QA, I tried annotating the reference genome using my >>> evidence data and Augustus as gene predictor. This should allow me to >>> compare my result to the ITAG annotation, which I assume to be the >>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>> annotation was also created using Maker, followed by manual curation. >>> I started by comparing the protein sets from my result and the ITAT set. >>> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >>> that only about 70% of the ITAG proteins are covered by a protein from my >>> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >>> further investigated by running BUSCO on both protein sets and looking at >>> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >>> from a genome browser where you can see such a case. Top track is the ITAG >>> gene model, below is my result. Third track is the protein evidence >>> alignments (i.e blastx and protein2genome features), and bottom track are >>> masked repeats. >>> As you can see, there seems to be two issues with my result: >>> 1. The two genes in ITAG were fused into one. I guess this is a >>> difficult case as the genes are really close together. >>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>> my result. This is in fact the reason I ended up with a truncated protein >>> and a missing BUSCO. >>> This is a bit surprising to me, since there seems to be quite a lot of >>> protein evidence supporting this region as a CDS. Can you help me figure >>> out why is the result so? Could it be due to the small repeats detected in >>> this region? >>> Any ideas on how my result can be improved without manual curation? >>> >>> Many thanks! >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> -- >> Xabier V?zquez-Campos, *PhD* >> *Research Associate* >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 17:52:47 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 17:52:47 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? > I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. > My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From myandell at genetics.utah.edu Thu Oct 4 18:05:04 2018 From: myandell at genetics.utah.edu (Mark Yandell) Date: Fri, 5 Oct 2018 00:05:04 +0000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Cheers! From: maker-devel on behalf of Carson Holt Date: Thursday, October 4, 2018 at 5:52 PM To: Lior Glick Cc: Maker Mailing List Subject: Re: [maker-devel] Help debugging a MAKER result I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos? ?>: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: Hi MAKER users, I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. As you can see, there seems to be two issues with my result: 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? Any ideas on how my result can be improved without manual curation? Many thanks! _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 18:09:58 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 18:09:58 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: One correction. I meant to say set unmask=1. ?Carson > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. > > ?Carson > > >> On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: >> >> Yeah, tomato should be rather well annotated. >> >> I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things >> >> You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. >> >> I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. >> >> On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation >> >> >> On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? >> I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. >> My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Fri Oct 5 00:51:41 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Fri, 5 Oct 2018 09:51:41 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Thank you both for your helpful ideas. I'm going to give them a try and see how this effects my results. Will update when I have them. Cheers indeed. ??????? ??? ??, 5 ????? 2018 ?-3:10 ??? ?Carson Holt?? :? > One correction. I meant to say set unmask=1. > > ?Carson > > > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself > does not generate models. In your case, Augustus produces the models. > Augustus will run twice. Once on it?s own (this will be on a repeat masked > version of the assembly), and once again where MAKER provides it with a > hints file as part of the command line used to run Augustus. The hints file > is generated from the evidence alignments you provided to MAKER. The hints > usually get Augustus to perform a little better than it does with training > alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. > MAKER hard masks complex repeats in the assembly (turns them from ATCG into > N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The > lower case ?soft-masking? affects BLAST alignment but not Augustus > predictions (Augustus ignores it). MAKER also removes the hard-masking when > it runs Augustus with the hints file. This is done because we?ve > constrained Augustus to a smaller padded evidence cluster at the locus, and > Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can > set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked > assembly). You can then look at contigs in a browser to see how the masked > vs unmasked models compare to each other. > > ?Carson > > > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the > creation of the gene model. Also, creating a new Augustus model based on > the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I > would still run the advanced repeat library protocol, if it isn't > computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to > try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like > RATT. This would annotate all the common regions with the reference and > then curate only on the regions that cannot be annotated from the reference > using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the >> ITAG annotation as repeats library, so I expect these to be quite >> appropriate. I guess my question is regarding the way Maker makes >> decisions: Is the fact that some repeats (simple repeats in this case) were >> predicted is enough to change a CDS into a UTR, despite sufficient protein >> evidence? >> I did not train Augustus myself, rather I used the species (tomato) >> profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with >> that since cultivated tomato varieties are repeatedly selfed, so they are >> (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? > xvazquezc at gmail.com??>:? >> >>> Hi Lior, >>> >>> without getting in a lot of detail a good model covering the repeats in >>> your genome is extremely important, specially in genomes with a lot of >>> repeats. If the repeat library does not have an appropriate coverage, >>> anything based on the masked genome will be affected >>> >>> The evidence you pass into Augustus to generate the gene model can have >>> a huge impact. Aside of the repeats, BUSCO-generated gene models can >>> under-predict >>> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >>> And we have seen in our lab that the gene models generated by Augustus >>> can be very different if you provide an haploid assembly vs haploid + >>> alternate contigs vs diploid. In general, a purely haploid assembly >>> generates a less biased model as it has lower number of duplicated >>> conserved genes present, that will unbalance the gene model towards them. >>> (at least in BUSCO-based models, but it should be extensible to any >>> Augustus model) >>> >>> Note that in the end the generated annotation is just a model/hypothesis >>> and may require more than a bit of curation... usually increasing with more >>> complex genomes. >>> >>> Cheers, >>> Xabi >>> >>> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >>> >>>> Hi MAKER users, >>>> I am new to Maker and had just finished running my first annotations. >>>> Although the results make sense in general, I have reasons to suspect some >>>> gene models are wrong and would like your help in understanding and >>>> optimizing the results. >>>> My research project involves the annotation of multiple tomato >>>> varieties (individuals) which are a bit different from the published >>>> reference genome. To this end, I created de-novo assemblies of these >>>> genomes and also generated an evidence set to be used as input for Maker. >>>> Evidence consist of a large set of transcripts from various tomato >>>> varieties and conditions, as well as full protein sets from 6 plant >>>> species, including the proteins derived from the annotation of the >>>> reference - called ITAG. >>>> For an initial QA, I tried annotating the reference genome using my >>>> evidence data and Augustus as gene predictor. This should allow me to >>>> compare my result to the ITAG annotation, which I assume to be the >>>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>>> annotation was also created using Maker, followed by manual curation. >>>> I started by comparing the protein sets from my result and the ITAT >>>> set. Specifically, I ran an all-vs-all blast and took the top hits. I >>>> discovered that only about 70% of the ITAG proteins are covered by a >>>> protein from my result with a high quality alignment (evalue > 10e-5, >>>> coverage > 90%). I further investigated by running BUSCO on both protein >>>> sets and looking at BUSCOs found in ITAG but missing in my result. Attached >>>> is a screenshot from a genome browser where you can see such a case. Top >>>> track is the ITAG gene model, below is my result. Third track is the >>>> protein evidence alignments (i.e blastx and protein2genome features), and >>>> bottom track are masked repeats. >>>> As you can see, there seems to be two issues with my result: >>>> 1. The two genes in ITAG were fused into one. I guess this is a >>>> difficult case as the genes are really close together. >>>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>>> my result. This is in fact the reason I ended up with a truncated protein >>>> and a missing BUSCO. >>>> This is a bit surprising to me, since there seems to be quite a lot of >>>> protein evidence supporting this region as a CDS. Can you help me figure >>>> out why is the result so? Could it be due to the small repeats detected in >>>> this region? >>>> Any ideas on how my result can be improved without manual curation? >>>> >>>> Many thanks! >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> -- >>> Xabier V?zquez-Campos, *PhD* >>> *Research Associate* >>> NSW Systems Biology Initiative >>> School of Biotechnology and Biomolecular Sciences >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 5 14:37:34 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 5 Oct 2018 14:37:34 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: Message-ID: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> I tried setting this up but there are a number of issues I run into. First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> RepeatMasker.lib RepeatMasker.lib.nhr RepeatMasker.lib.nin RepeatMasker.lib.nsq RepeatMaskerLib.embl But they do not exist in the share directory. Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. Another work around is don?t use OpenMPI. Try MPICH3. ?Carson > On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau wrote: > > Hi, > > I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: > > STATUS: Processing and indexing input FASTA files... > [cl1n022:06306] *** Process received signal *** > [cl1n022:06306] Signal: Segmentation fault (11) > [cl1n022:06306] Signal code: Address not mapped (1) > [cl1n022:06306] Failing at address: 0x514 > [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] > [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] > [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] > [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] > [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] > [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] > [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] > [cl1n022:06306] *** End of error message *** > SIGTERM received > SIGTERM received > > > As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. > > As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. > > Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe > Any help would be highly appreciated! > > Anthony Bretaudeau > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 10:34:22 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 16:34:22 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> Message-ID: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 11:31:04 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 17:31:04 +0000 Subject: [maker-devel] maker problem In-Reply-To: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> Message-ID: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 11:45:31 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 17:45:31 +0000 Subject: [maker-devel] maker problem In-Reply-To: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 12:08:49 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 18:08:49 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 13:12:27 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 19:12:27 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: ok, let me explain my case. Genome- eukaryote We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. atleast=transcripts.fasta (from in-house sequenced genome (already published)) est2genome=1 protein2genome=1 Sorry for not explaining my case initially. What can be other files I can use as est evidence? Can I use Augustus generated hints for gene prediction along with above options? Your thoughts?? Parul On Oct 8, 2018, at 1:08 PM, Carson Hinton Holt > wrote: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 8 14:11:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 8 Oct 2018 14:11:26 -0600 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> > We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. > Transcripts- > We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > Proteins- > I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. > atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson From liorglic at mail.tau.ac.il Wed Oct 17 08:27:06 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Wed, 17 Oct 2018 17:27:06 +0300 Subject: [maker-devel] Problem compiling MAKER with Intel MPI Message-ID: Hello, I am trying to compile MAKER with Intel MPI. We are using a cluster based on Intel x86_64 architecture and using lmod for environment variables. All required dependencies have already been installed and the initial 'perl Build.PL' passes without issues (see attached). When running './Build install' it always fails to find 'sys/types.h' and exits (see additional attachment). The Build command probably searches for the '/usr/include/sys/types.h' file, but no matter which variable (INCLUDE, PERL5LIB etc...) I update with the required path (either '/usr/include' or '/usr/include/sys') - it keeps failing. I would appreciate your input. Thanks a lot! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build.PL.out Type: application/octet-stream Size: 2032 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build_install.out Type: application/octet-stream Size: 6312 bytes Desc: not available URL: From anthony.bretaudeau at inria.fr Thu Oct 18 07:52:03 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Thu, 18 Oct 2018 15:52:03 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 14:40:06 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 20:40:06 +0000 Subject: [maker-devel] maker problem In-Reply-To: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> Message-ID: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. I used est_fasta not the est_gff. Find a contig with protein2genome results in the GFF3 yes I can see protein2genome results in gff3: ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; and est2genome in gff3 as well: ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; Thanks, Parul On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Fri Oct 12 02:23:07 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Fri, 12 Oct 2018 10:23:07 +0200 Subject: [maker-devel] maker-level google group Message-ID: Dear maker team, I hope this email finds you well. I am a member of the maker-devel google group, but, somehow, I cannot post questions. Is there anything I can do on my end to fix this? Also, I was wondering where can I download maker3 (I cannot seem to find it online). I have been using maker2, but I wanted to use EVM, and I have read that maker3 implements it. Thank you so much for your help, Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From yli at utexas.edu Tue Oct 16 22:49:13 2018 From: yli at utexas.edu (Yiyuan Li) Date: Tue, 16 Oct 2018 23:49:13 -0500 Subject: [maker-devel] Speed up maker annotation on long scaffolds Message-ID: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Dear Maker support, I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? Thank you! YY From peachandolives at gmail.com Thu Oct 18 02:29:57 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Thu, 18 Oct 2018 10:29:57 +0200 Subject: [maker-devel] maker3 Message-ID: Dear maker team, I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. I would appreciate any help. Thank you very much! Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:02:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:02:22 -0600 Subject: [maker-devel] maker problem In-Reply-To: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> Message-ID: <3F78E884-11AF-4291-A8FC-D81F6F55B47D@gmail.com> Once Augustus is trained it will have a new species directory under ?/augustus/config/species/ for the organism you just trained. Or if you trained augustus elsewhere (website, BUSCO, etc.) you have to copy the species data there. Then you just supply the species name and Augustus automatically finds it (see Augustus documentation on training). For est2genome=1 and protein2genome=1, MAKER takes the alignments from exonerate protein2genome and est2genome and if they are mostly open reading frame, just turns them directly into gene/mRNA/exon/CDS models. If there are none of those in the resulting GFF3 but there are est2genome and protein2genome alignments then all of them have broken ORF. That means there are serious issues with your assembly, or with the est fasta or protein fasta file. For a protein fasta, I recomend using uniprot/swissprot because it is manually curated and contains a broad dataset. But if you cannot get gene models from uniprot/swissprot protein2genome alignments, then your assembly has issues (either too fragmented, lots of errors inducing random stop codons, or lots of N?s interspersed in the sequence). ?Carson > On Oct 8, 2018, at 2:40 PM, Gupta, Parul wrote: > > I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. > >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > > I used est_fasta not the est_gff. > >> Find a contig with protein2genome results in the GFF3 > > yes I can see protein2genome results in gff3: > > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 31566 32621 > 1426 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31566 > 31775 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31872 > 32621 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 33816 35829 > 1394 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 34916 > 35829 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 33816 > 34182 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 49636 51466 > 1091 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 51354 > 51466 1091 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; > > and est2genome in gff3 as well: > > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16239 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889982 > 48890708 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16412 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889949 > 48890708 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48895479 48899036 > 9582 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; > > Thanks, > Parul > >> On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: >> >> >>> We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. >> >> Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. >> >> >>> Transcripts- >>> We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. >> >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. >> >> >>> Proteins- >>> I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. >> >> Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. >> >> >>> atleast=transcripts.fasta (from in-house sequenced genome (already published)) >> >> These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). >> >> ?Carson >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:09:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:09:30 -0600 Subject: [maker-devel] Speed up maker annotation on long scaffolds In-Reply-To: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> References: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Message-ID: <28BAD1D1-77BA-4F50-A54F-7E402589E76F@gmail.com> You might not have MPI setup correctly. MPI spread across 10 machines (20 cores each) can annotate an entire maize chromosome in ~20 minutes. A few tests. #this command should print all the hosts you are running MPI on and how many cores on each host. If you don?t see multiple hosts you are not spreading across machines. mpiexec hostname | sort | uniq -c #this will let you know if maker is running MPI correctly (should print help message only once) mpiexec maker -h ?Carson > On Oct 16, 2018, at 10:49 PM, Yiyuan Li wrote: > > Dear Maker support, > I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? > > Thank you! > > YY > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Oct 19 11:22:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:22:12 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. ?Carson > On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: > > Hi, > > I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. > > After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). > > This + the backtrace shows that it has something to do with signal handling when running child process from threads. > > After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments > It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. > > > > Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 > I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). > > I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... > > I guess it's the same question for te_proteins. > > > > Cheers > > Anthony > > > > > > Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >> I tried setting this up but there are a number of issues I run into. >> >> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >> RepeatMasker.lib >> RepeatMasker.lib.nhr >> RepeatMasker.lib.nin >> RepeatMasker.lib.nsq >> RepeatMaskerLib.embl >> >> But they do not exist in the share directory. >> >> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >> >> >> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >> >> Another work around is don?t use OpenMPI. Try MPICH3. >> >> >> ?Carson >> >> >> >> >> >>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>> >>> Hi, >>> >>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>> >>> STATUS: Processing and indexing input FASTA files... >>> [cl1n022:06306] *** Process received signal *** >>> [cl1n022:06306] Signal: Segmentation fault (11) >>> [cl1n022:06306] Signal code: Address not mapped (1) >>> [cl1n022:06306] Failing at address: 0x514 >>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>> [cl1n022:06306] *** End of error message *** >>> SIGTERM received >>> SIGTERM received >>> >>> >>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>> >>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>> >>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>> Any help would be highly appreciated! >>> >>> Anthony Bretaudeau >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:25:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:25:40 -0600 Subject: [maker-devel] maker3 In-Reply-To: References: Message-ID: <1D30ACCC-1DC4-451E-8553-8AB8ADA269A2@gmail.com> The maker 3 beta is one of the links when you registre to download maker. IT will be the link directly under the stable release link ?> http://yandell.topaz.genetics.utah.edu/cgi-bin/maker_license.cgi Also you can use grep to pull out specific lines of a gff3 file. Example: grep -P "\tprotein2genome\t" all.gff > protein2genome.gff That command will grab all the protein2genome features out of a file. ?Carson > On Oct 18, 2018, at 2:29 AM, Linnie Linnie wrote: > > Dear maker team, > > I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. > > Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. > > I would appreciate any help. Thank you very much! > > Linnie > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Tue Oct 23 07:56:09 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Tue, 23 Oct 2018 15:56:09 +0200 Subject: [maker-devel] CIGAR string explanation Message-ID: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Hello, Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ -- completed exonerate analysis and here the result we get in the protein2genome.gff output from MAKER @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. Could you explain their meanings ? Best regards, /Jacques ------------------------------------------------- Jacques Dainat, Ph.D. NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service http://nbis.se/about/staff/jacques-dainat http://nbis.se ? Contact ? Address: Uppsala University, Biomedicinska Centrum Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: +46 18 471 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 23 09:55:51 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 23 Oct 2018 09:55:51 -0600 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Message-ID: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. ?Carson > On Oct 23, 2018, at 7:56 AM, Jacques Dainat wrote: > > Hello, > > Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) > > cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 > vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ > -- completed exonerate analysis > > > and here the result we get in the protein2genome.gff output from MAKER > > @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 > @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 > @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 > @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 > @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 > @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 > @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 > @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 > @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 > @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 > @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 > @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 > @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 > @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 > > MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. > Could you explain their meanings ? > > Best regards, > > /Jacques > ------------------------------------------------- > Jacques Dainat, Ph.D. > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > http://nbis.se/about/staff/jacques-dainat > http://nbis.se > > ? Contact ? > Address: Uppsala University, Biomedicinska Centrum > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: +46 18 471 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Wed Oct 24 03:28:52 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Wed, 24 Oct 2018 05:28:52 -0400 Subject: [maker-devel] EVM control file and est2genome Message-ID: Hi, I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. The first one is the following: I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. Does this mean I can only use est2genome with ESTs from the species of interest? The second error relates to EVM: I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: >ERROR: You have failed to provide a value for 'evm' in the control files. Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. I would appreciate any input on any of these issues. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at inria.fr Wed Oct 24 09:07:48 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Wed, 24 Oct 2018 17:07:48 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 09:46:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:46:30 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: <62EBFA6C-4194-4D65-8313-F67EFCAEF47A@gmail.com> It divides up pieces of contigs as well as individual steps. BLAST, exonerate, snap, augustus can each run on separate machines. ?Carson > On Oct 24, 2018, at 9:07 AM, Anthony Bretaudeau wrote: > > Hi, > > I'll see if I can improve the conda recipe. > > Just one simple question: how does Maker divide the work between worker nodes in mpi mode? Is it supposed to be 1 contig per node or are the largest contig splitted into smaller chunks, each one potentially treated on different nodes? From my tests I have the feeling it is the first answer, but I'm not sure if it's normal or not. > > Anthony > > Le 19/10/2018 ? 19:22, Carson Holt a ?crit : >> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. >> >> For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> >> >> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb >> >> te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. >> >> ?Carson >> >> >> >> >>> On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: >>> >>> Hi, >>> >>> I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. >>> >>> After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). >>> >>> This + the backtrace shows that it has something to do with signal handling when running child process from threads. >>> >>> After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments >>> It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. >>> >>> >>> >>> Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 >>> I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). >>> >>> I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... >>> >>> I guess it's the same question for te_proteins. >>> >>> >>> >>> Cheers >>> >>> Anthony >>> >>> >>> >>> >>> >>> Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >>>> I tried setting this up but there are a number of issues I run into. >>>> >>>> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >>>> RepeatMasker.lib >>>> RepeatMasker.lib.nhr >>>> RepeatMasker.lib.nin >>>> RepeatMasker.lib.nsq >>>> RepeatMaskerLib.embl >>>> >>>> But they do not exist in the share directory. >>>> >>>> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >>>> >>>> >>>> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >>>> >>>> Another work around is don?t use OpenMPI. Try MPICH3. >>>> >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> >>>>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>>>> >>>>> STATUS: Processing and indexing input FASTA files... >>>>> [cl1n022:06306] *** Process received signal *** >>>>> [cl1n022:06306] Signal: Segmentation fault (11) >>>>> [cl1n022:06306] Signal code: Address not mapped (1) >>>>> [cl1n022:06306] Failing at address: 0x514 >>>>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>>>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>>>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>>>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>>>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>>>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>>>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>>>> [cl1n022:06306] *** End of error message *** >>>>> SIGTERM received >>>>> SIGTERM received >>>>> >>>>> >>>>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>>>> >>>>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>>>> >>>>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>>>> Any help would be highly appreciated! >>>>> >>>>> Anthony Bretaudeau >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 09:50:43 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:50:43 -0600 Subject: [maker-devel] EVM control file and est2genome In-Reply-To: References: Message-ID: <3CB0FDB0-8B7D-4CF8-B957-5935166D5305@gmail.com> est2genome only works with the data given to est=. For the second error, you must provide the path of the evm executable in maker_exe.ctl. It apparently was not in your PATH, so it didn?t get automatically filled out. Here is an example from the wiki of using est2genome and protein2genome to train SNAP for the next MAKER run ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Oct 24, 2018, at 3:28 AM, Linnie Linnie wrote: > > Hi, > > I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. > > The first one is the following: > I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: > > >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. > > Does this mean I can only use est2genome with ESTs from the species of interest? > > The second error relates to EVM: > I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: > > >ERROR: You have failed to provide a value for 'evm' in the control files. > > Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? > > > And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. > > I would appreciate any input on any of these issues. > > Thank you! > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Wed Oct 24 02:41:05 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Wed, 24 Oct 2018 10:41:05 +0200 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Message-ID: Thanks for your response. It?s surprising the link in the Sequence Ontology web site doesn?t work anymore. I will notify them. I was surprise that I was not able finding any resource on internet describing these values. Helped by your answer I have refined my key words and googled again, and I finnaly found old ressources describing that too. from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal I put a copy here of the Wormbase description in case those resources also disappear. At that time it sounds it was not yet officialy accepted by the SO. /Jacques > On 23 Oct 2018, at 17:55, Carson Holt wrote: > > Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. > > ?Carson > > >> On Oct 23, 2018, at 7:56 AM, Jacques Dainat > wrote: >> >> Hello, >> >> Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) >> >> cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 >> vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ >> -- completed exonerate analysis >> >> >> and here the result we get in the protein2genome.gff output from MAKER >> >> @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 >> @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 >> @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 >> @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 >> @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 >> @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 >> @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 >> @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 >> @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 >> @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 >> @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 >> @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 >> @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 >> @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 >> >> MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. >> Could you explain their meanings ? >> >> Best regards, >> >> /Jacques >> ------------------------------------------------- >> Jacques Dainat, Ph.D. >> NBIS (National Bioinformatics Infrastructure Sweden) >> Genome Annotation Service >> http://nbis.se/about/staff/jacques-dainat >> http://nbis.se >> >> ? Contact ? >> Address: Uppsala University, Biomedicinska Centrum >> Department of Medical Biochemistry Microbiology, Genomics >> Husargatan 3, box 582 >> S-75123 Uppsala Sweden >> Phone: +46 18 471 46 25 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2018-10-24 at 10.00.41.png Type: image/png Size: 281561 bytes Desc: not available URL: From elyssa_garza at yahoo.com Wed Oct 24 15:27:50 2018 From: elyssa_garza at yahoo.com (Elyssa Garza) Date: Wed, 24 Oct 2018 21:27:50 +0000 (UTC) Subject: [maker-devel] Is gene retrieval from gff possible? In-Reply-To: <1576161756.398305.1540414096080@mail.yahoo.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <1576161756.398305.1540414096080@mail.yahoo.com> Message-ID: <1888825059.421524.1540416470195@mail.yahoo.com> Hello I recently annotated my plant genome and am looking at retrieving a particular set of genes from the maker results. I have a list of TAIR Ids that I am particularly interested in and was thinking about using the gff file to help pull out the associated transcripts. I was wondering if you could advise me on the best or easiest way of obtaining the associated TAIR accession or gene model from the gff file. I did try looking at the genes (41,779 genes) using CLCbio but the accessions were not easily identified or found. I also looked at the protein matches (819,805 protein matches) and was able to easily find gene model matches corresponding to my target accessions. Is it wise to do this? Can you explain why I can't find these same protein matches in the gene file? I have some ideas on why this is happening but I am looking for support for them. Elyssa -------------- next part -------------- An HTML attachment was scrubbed... URL: From pallavi.gupta at slu.edu Thu Oct 25 15:22:31 2018 From: pallavi.gupta at slu.edu (Pallavi Gupta) Date: Thu, 25 Oct 2018 21:22:31 +0000 Subject: [maker-devel] Issue with maker Message-ID: Hi Team MAKER, I am using maker for my research for genome annotation process. But when I run maker I am getting a weird error. I tried finding a work around on the internet by scrolling through various bioinformatics forum but I was unsuccessful. I will really appreciate if you can help me in this regard. I have attached my nohup.out log. Please let me know if you need anything else. Thanks, Pallavi Gupta -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nohup.out Type: application/octet-stream Size: 26365432 bytes Desc: nohup.out URL: From 17na34 at queensu.ca Wed Oct 31 08:27:44 2018 From: 17na34 at queensu.ca (Nikolay Alabi) Date: Wed, 31 Oct 2018 14:27:44 +0000 Subject: [maker-devel] MAKER not running properly after installation, help needed Message-ID: Hello, I am attempting to annotate a garlic mustard genome using maker on a cluster at Queen?s University. I have been following the tutorial on wiki and was attempting to use the practice data to see if the program is running properly and to learn how to train the gene predicting system. Maker is now installed and is working to an extent, however when in use it is not working properly and cannot read/annotate a genome. I suspect two problems that is causing this to occur, first, anytime any maker command is called, it shows that an argument in forks.pm in perl5 is not correct, after trying to fix the problem, I see that the code should be correct, but the error line still occurs. Then every time a maker command Is called another error saying there is an error flow occurring somewhere in perl again. For instance when I command: maker -h, or maker -CTL or anything to do with maker, the error lines occur. Would you advise me to reinstall perl and bioperl? Other than that I believe everything else is properly installed and I do not understand why the program is not running properly. I have even tried using different data genomes, however the same problem occurs of the run never finishing, then retrying, and ultimately failing. Please let me know if there is another possible source of error in the problem. Best regards, Nikolay -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Oct 1 18:00:43 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Tue, 2 Oct 2018 10:00:43 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. > Although the results make sense in general, I have reasons to suspect some > gene models are wrong and would like your help in understanding and > optimizing the results. > My research project involves the annotation of multiple tomato varieties > (individuals) which are a bit different from the published reference > genome. To this end, I created de-novo assemblies of these genomes and also > generated an evidence set to be used as input for Maker. Evidence consist > of a large set of transcripts from various tomato varieties and conditions, > as well as full protein sets from 6 plant species, including the proteins > derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my > evidence data and Augustus as gene predictor. This should allow me to > compare my result to the ITAG annotation, which I assume to be the > "correct" answer, and see how well I'm doing. I should mention that ITAG > annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. > Specifically, I ran an all-vs-all blast and took the top hits. I discovered > that only about 70% of the ITAG proteins are covered by a protein from my > result with a high quality alignment (evalue > 10e-5, coverage > 90%). I > further investigated by running BUSCO on both protein sets and looking at > BUSCOs found in ITAG but missing in my result. Attached is a screenshot > from a genome browser where you can see such a case. Top track is the ITAG > gene model, below is my result. Third track is the protein evidence > alignments (i.e blastx and protein2genome features), and bottom track are > masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult > case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my > result. This is in fact the reason I ended up with a truncated protein and > a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of > protein evidence supporting this region as a CDS. Can you help me figure > out why is the result so? Could it be due to the small repeats detected in > this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Tue Oct 2 00:50:32 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Tue, 2 Oct 2018 09:50:32 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? :? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in > your genome is extremely important, specially in genomes with a lot of > repeats. If the repeat library does not have an appropriate coverage, > anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a > huge impact. Aside of the repeats, BUSCO-generated gene models can > under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can > be very different if you provide an haploid assembly vs haploid + alternate > contigs vs diploid. In general, a purely haploid assembly generates a less > biased model as it has lower number of duplicated conserved genes present, > that will unbalance the gene model towards them. (at least in BUSCO-based > models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis > and may require more than a bit of curation... usually increasing with more > complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. >> Although the results make sense in general, I have reasons to suspect some >> gene models are wrong and would like your help in understanding and >> optimizing the results. >> My research project involves the annotation of multiple tomato varieties >> (individuals) which are a bit different from the published reference >> genome. To this end, I created de-novo assemblies of these genomes and also >> generated an evidence set to be used as input for Maker. Evidence consist >> of a large set of transcripts from various tomato varieties and conditions, >> as well as full protein sets from 6 plant species, including the proteins >> derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my >> evidence data and Augustus as gene predictor. This should allow me to >> compare my result to the ITAG annotation, which I assume to be the >> "correct" answer, and see how well I'm doing. I should mention that ITAG >> annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. >> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >> that only about 70% of the ITAG proteins are covered by a protein from my >> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >> further investigated by running BUSCO on both protein sets and looking at >> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >> from a genome browser where you can see such a case. Top track is the ITAG >> gene model, below is my result. Third track is the protein evidence >> alignments (i.e blastx and protein2genome features), and bottom track are >> masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult >> case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >> my result. This is in fact the reason I ended up with a truncated protein >> and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of >> protein evidence supporting this region as a CDS. Can you help me figure >> out why is the result so? Could it be due to the small repeats detected in >> this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Oct 2 22:39:40 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 3 Oct 2018 14:39:40 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the > ITAG annotation as repeats library, so I expect these to be quite > appropriate. I guess my question is regarding the way Maker makes > decisions: Is the fact that some repeats (simple repeats in this case) were > predicted is enough to change a CDS into a UTR, despite sufficient protein > evidence? > I did not train Augustus myself, rather I used the species (tomato) > profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with > that since cultivated tomato varieties are repeatedly selfed, so they are > (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? xvazquezc at gmail.com??>:? > >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in >> your genome is extremely important, specially in genomes with a lot of >> repeats. If the repeat library does not have an appropriate coverage, >> anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a >> huge impact. Aside of the repeats, BUSCO-generated gene models can >> under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus >> can be very different if you provide an haploid assembly vs haploid + >> alternate contigs vs diploid. In general, a purely haploid assembly >> generates a less biased model as it has lower number of duplicated >> conserved genes present, that will unbalance the gene model towards them. >> (at least in BUSCO-based models, but it should be extensible to any >> Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis >> and may require more than a bit of curation... usually increasing with more >> complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >> >>> Hi MAKER users, >>> I am new to Maker and had just finished running my first annotations. >>> Although the results make sense in general, I have reasons to suspect some >>> gene models are wrong and would like your help in understanding and >>> optimizing the results. >>> My research project involves the annotation of multiple tomato varieties >>> (individuals) which are a bit different from the published reference >>> genome. To this end, I created de-novo assemblies of these genomes and also >>> generated an evidence set to be used as input for Maker. Evidence consist >>> of a large set of transcripts from various tomato varieties and conditions, >>> as well as full protein sets from 6 plant species, including the proteins >>> derived from the annotation of the reference - called ITAG. >>> For an initial QA, I tried annotating the reference genome using my >>> evidence data and Augustus as gene predictor. This should allow me to >>> compare my result to the ITAG annotation, which I assume to be the >>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>> annotation was also created using Maker, followed by manual curation. >>> I started by comparing the protein sets from my result and the ITAT set. >>> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >>> that only about 70% of the ITAG proteins are covered by a protein from my >>> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >>> further investigated by running BUSCO on both protein sets and looking at >>> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >>> from a genome browser where you can see such a case. Top track is the ITAG >>> gene model, below is my result. Third track is the protein evidence >>> alignments (i.e blastx and protein2genome features), and bottom track are >>> masked repeats. >>> As you can see, there seems to be two issues with my result: >>> 1. The two genes in ITAG were fused into one. I guess this is a >>> difficult case as the genes are really close together. >>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>> my result. This is in fact the reason I ended up with a truncated protein >>> and a missing BUSCO. >>> This is a bit surprising to me, since there seems to be quite a lot of >>> protein evidence supporting this region as a CDS. Can you help me figure >>> out why is the result so? Could it be due to the small repeats detected in >>> this region? >>> Any ideas on how my result can be improved without manual curation? >>> >>> Many thanks! >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> -- >> Xabier V?zquez-Campos, *PhD* >> *Research Associate* >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 17:52:47 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 17:52:47 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? > I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. > My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From myandell at genetics.utah.edu Thu Oct 4 18:05:04 2018 From: myandell at genetics.utah.edu (Mark Yandell) Date: Fri, 5 Oct 2018 00:05:04 +0000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Cheers! From: maker-devel on behalf of Carson Holt Date: Thursday, October 4, 2018 at 5:52 PM To: Lior Glick Cc: Maker Mailing List Subject: Re: [maker-devel] Help debugging a MAKER result I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos? ?>: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: Hi MAKER users, I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. As you can see, there seems to be two issues with my result: 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? Any ideas on how my result can be improved without manual curation? Many thanks! _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 18:09:58 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 18:09:58 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: One correction. I meant to say set unmask=1. ?Carson > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. > > ?Carson > > >> On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: >> >> Yeah, tomato should be rather well annotated. >> >> I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things >> >> You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. >> >> I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. >> >> On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation >> >> >> On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? >> I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. >> My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Fri Oct 5 00:51:41 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Fri, 5 Oct 2018 09:51:41 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Thank you both for your helpful ideas. I'm going to give them a try and see how this effects my results. Will update when I have them. Cheers indeed. ??????? ??? ??, 5 ????? 2018 ?-3:10 ??? ?Carson Holt?? :? > One correction. I meant to say set unmask=1. > > ?Carson > > > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself > does not generate models. In your case, Augustus produces the models. > Augustus will run twice. Once on it?s own (this will be on a repeat masked > version of the assembly), and once again where MAKER provides it with a > hints file as part of the command line used to run Augustus. The hints file > is generated from the evidence alignments you provided to MAKER. The hints > usually get Augustus to perform a little better than it does with training > alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. > MAKER hard masks complex repeats in the assembly (turns them from ATCG into > N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The > lower case ?soft-masking? affects BLAST alignment but not Augustus > predictions (Augustus ignores it). MAKER also removes the hard-masking when > it runs Augustus with the hints file. This is done because we?ve > constrained Augustus to a smaller padded evidence cluster at the locus, and > Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can > set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked > assembly). You can then look at contigs in a browser to see how the masked > vs unmasked models compare to each other. > > ?Carson > > > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the > creation of the gene model. Also, creating a new Augustus model based on > the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I > would still run the advanced repeat library protocol, if it isn't > computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to > try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like > RATT. This would annotate all the common regions with the reference and > then curate only on the regions that cannot be annotated from the reference > using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the >> ITAG annotation as repeats library, so I expect these to be quite >> appropriate. I guess my question is regarding the way Maker makes >> decisions: Is the fact that some repeats (simple repeats in this case) were >> predicted is enough to change a CDS into a UTR, despite sufficient protein >> evidence? >> I did not train Augustus myself, rather I used the species (tomato) >> profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with >> that since cultivated tomato varieties are repeatedly selfed, so they are >> (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? > xvazquezc at gmail.com??>:? >> >>> Hi Lior, >>> >>> without getting in a lot of detail a good model covering the repeats in >>> your genome is extremely important, specially in genomes with a lot of >>> repeats. If the repeat library does not have an appropriate coverage, >>> anything based on the masked genome will be affected >>> >>> The evidence you pass into Augustus to generate the gene model can have >>> a huge impact. Aside of the repeats, BUSCO-generated gene models can >>> under-predict >>> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >>> And we have seen in our lab that the gene models generated by Augustus >>> can be very different if you provide an haploid assembly vs haploid + >>> alternate contigs vs diploid. In general, a purely haploid assembly >>> generates a less biased model as it has lower number of duplicated >>> conserved genes present, that will unbalance the gene model towards them. >>> (at least in BUSCO-based models, but it should be extensible to any >>> Augustus model) >>> >>> Note that in the end the generated annotation is just a model/hypothesis >>> and may require more than a bit of curation... usually increasing with more >>> complex genomes. >>> >>> Cheers, >>> Xabi >>> >>> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >>> >>>> Hi MAKER users, >>>> I am new to Maker and had just finished running my first annotations. >>>> Although the results make sense in general, I have reasons to suspect some >>>> gene models are wrong and would like your help in understanding and >>>> optimizing the results. >>>> My research project involves the annotation of multiple tomato >>>> varieties (individuals) which are a bit different from the published >>>> reference genome. To this end, I created de-novo assemblies of these >>>> genomes and also generated an evidence set to be used as input for Maker. >>>> Evidence consist of a large set of transcripts from various tomato >>>> varieties and conditions, as well as full protein sets from 6 plant >>>> species, including the proteins derived from the annotation of the >>>> reference - called ITAG. >>>> For an initial QA, I tried annotating the reference genome using my >>>> evidence data and Augustus as gene predictor. This should allow me to >>>> compare my result to the ITAG annotation, which I assume to be the >>>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>>> annotation was also created using Maker, followed by manual curation. >>>> I started by comparing the protein sets from my result and the ITAT >>>> set. Specifically, I ran an all-vs-all blast and took the top hits. I >>>> discovered that only about 70% of the ITAG proteins are covered by a >>>> protein from my result with a high quality alignment (evalue > 10e-5, >>>> coverage > 90%). I further investigated by running BUSCO on both protein >>>> sets and looking at BUSCOs found in ITAG but missing in my result. Attached >>>> is a screenshot from a genome browser where you can see such a case. Top >>>> track is the ITAG gene model, below is my result. Third track is the >>>> protein evidence alignments (i.e blastx and protein2genome features), and >>>> bottom track are masked repeats. >>>> As you can see, there seems to be two issues with my result: >>>> 1. The two genes in ITAG were fused into one. I guess this is a >>>> difficult case as the genes are really close together. >>>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>>> my result. This is in fact the reason I ended up with a truncated protein >>>> and a missing BUSCO. >>>> This is a bit surprising to me, since there seems to be quite a lot of >>>> protein evidence supporting this region as a CDS. Can you help me figure >>>> out why is the result so? Could it be due to the small repeats detected in >>>> this region? >>>> Any ideas on how my result can be improved without manual curation? >>>> >>>> Many thanks! >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> -- >>> Xabier V?zquez-Campos, *PhD* >>> *Research Associate* >>> NSW Systems Biology Initiative >>> School of Biotechnology and Biomolecular Sciences >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 5 14:37:34 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 5 Oct 2018 14:37:34 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: Message-ID: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> I tried setting this up but there are a number of issues I run into. First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> RepeatMasker.lib RepeatMasker.lib.nhr RepeatMasker.lib.nin RepeatMasker.lib.nsq RepeatMaskerLib.embl But they do not exist in the share directory. Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. Another work around is don?t use OpenMPI. Try MPICH3. ?Carson > On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau wrote: > > Hi, > > I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: > > STATUS: Processing and indexing input FASTA files... > [cl1n022:06306] *** Process received signal *** > [cl1n022:06306] Signal: Segmentation fault (11) > [cl1n022:06306] Signal code: Address not mapped (1) > [cl1n022:06306] Failing at address: 0x514 > [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] > [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] > [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] > [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] > [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] > [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] > [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] > [cl1n022:06306] *** End of error message *** > SIGTERM received > SIGTERM received > > > As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. > > As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. > > Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe > Any help would be highly appreciated! > > Anthony Bretaudeau > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 10:34:22 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 16:34:22 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> Message-ID: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 11:31:04 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 17:31:04 +0000 Subject: [maker-devel] maker problem In-Reply-To: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> Message-ID: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 11:45:31 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 17:45:31 +0000 Subject: [maker-devel] maker problem In-Reply-To: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 12:08:49 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 18:08:49 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 13:12:27 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 19:12:27 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: ok, let me explain my case. Genome- eukaryote We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. atleast=transcripts.fasta (from in-house sequenced genome (already published)) est2genome=1 protein2genome=1 Sorry for not explaining my case initially. What can be other files I can use as est evidence? Can I use Augustus generated hints for gene prediction along with above options? Your thoughts?? Parul On Oct 8, 2018, at 1:08 PM, Carson Hinton Holt > wrote: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 8 14:11:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 8 Oct 2018 14:11:26 -0600 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> > We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. > Transcripts- > We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > Proteins- > I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. > atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson From liorglic at mail.tau.ac.il Wed Oct 17 08:27:06 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Wed, 17 Oct 2018 17:27:06 +0300 Subject: [maker-devel] Problem compiling MAKER with Intel MPI Message-ID: Hello, I am trying to compile MAKER with Intel MPI. We are using a cluster based on Intel x86_64 architecture and using lmod for environment variables. All required dependencies have already been installed and the initial 'perl Build.PL' passes without issues (see attached). When running './Build install' it always fails to find 'sys/types.h' and exits (see additional attachment). The Build command probably searches for the '/usr/include/sys/types.h' file, but no matter which variable (INCLUDE, PERL5LIB etc...) I update with the required path (either '/usr/include' or '/usr/include/sys') - it keeps failing. I would appreciate your input. Thanks a lot! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build.PL.out Type: application/octet-stream Size: 2033 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build_install.out Type: application/octet-stream Size: 6313 bytes Desc: not available URL: From anthony.bretaudeau at inria.fr Thu Oct 18 07:52:03 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Thu, 18 Oct 2018 15:52:03 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 14:40:06 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 20:40:06 +0000 Subject: [maker-devel] maker problem In-Reply-To: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> Message-ID: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. I used est_fasta not the est_gff. Find a contig with protein2genome results in the GFF3 yes I can see protein2genome results in gff3: ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; and est2genome in gff3 as well: ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; Thanks, Parul On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Fri Oct 12 02:23:07 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Fri, 12 Oct 2018 10:23:07 +0200 Subject: [maker-devel] maker-level google group Message-ID: Dear maker team, I hope this email finds you well. I am a member of the maker-devel google group, but, somehow, I cannot post questions. Is there anything I can do on my end to fix this? Also, I was wondering where can I download maker3 (I cannot seem to find it online). I have been using maker2, but I wanted to use EVM, and I have read that maker3 implements it. Thank you so much for your help, Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From yli at utexas.edu Tue Oct 16 22:49:13 2018 From: yli at utexas.edu (Yiyuan Li) Date: Tue, 16 Oct 2018 23:49:13 -0500 Subject: [maker-devel] Speed up maker annotation on long scaffolds Message-ID: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Dear Maker support, I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? Thank you! YY From peachandolives at gmail.com Thu Oct 18 02:29:57 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Thu, 18 Oct 2018 10:29:57 +0200 Subject: [maker-devel] maker3 Message-ID: Dear maker team, I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. I would appreciate any help. Thank you very much! Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:02:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:02:22 -0600 Subject: [maker-devel] maker problem In-Reply-To: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> Message-ID: <3F78E884-11AF-4291-A8FC-D81F6F55B47D@gmail.com> Once Augustus is trained it will have a new species directory under ?/augustus/config/species/ for the organism you just trained. Or if you trained augustus elsewhere (website, BUSCO, etc.) you have to copy the species data there. Then you just supply the species name and Augustus automatically finds it (see Augustus documentation on training). For est2genome=1 and protein2genome=1, MAKER takes the alignments from exonerate protein2genome and est2genome and if they are mostly open reading frame, just turns them directly into gene/mRNA/exon/CDS models. If there are none of those in the resulting GFF3 but there are est2genome and protein2genome alignments then all of them have broken ORF. That means there are serious issues with your assembly, or with the est fasta or protein fasta file. For a protein fasta, I recomend using uniprot/swissprot because it is manually curated and contains a broad dataset. But if you cannot get gene models from uniprot/swissprot protein2genome alignments, then your assembly has issues (either too fragmented, lots of errors inducing random stop codons, or lots of N?s interspersed in the sequence). ?Carson > On Oct 8, 2018, at 2:40 PM, Gupta, Parul wrote: > > I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. > >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > > I used est_fasta not the est_gff. > >> Find a contig with protein2genome results in the GFF3 > > yes I can see protein2genome results in gff3: > > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 31566 32621 > 1426 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31566 > 31775 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31872 > 32621 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 33816 35829 > 1394 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 34916 > 35829 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 33816 > 34182 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 49636 51466 > 1091 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 51354 > 51466 1091 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; > > and est2genome in gff3 as well: > > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16239 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889982 > 48890708 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16412 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889949 > 48890708 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48895479 48899036 > 9582 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; > > Thanks, > Parul > >> On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: >> >> >>> We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. >> >> Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. >> >> >>> Transcripts- >>> We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. >> >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. >> >> >>> Proteins- >>> I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. >> >> Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. >> >> >>> atleast=transcripts.fasta (from in-house sequenced genome (already published)) >> >> These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). >> >> ?Carson >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:09:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:09:30 -0600 Subject: [maker-devel] Speed up maker annotation on long scaffolds In-Reply-To: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> References: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Message-ID: <28BAD1D1-77BA-4F50-A54F-7E402589E76F@gmail.com> You might not have MPI setup correctly. MPI spread across 10 machines (20 cores each) can annotate an entire maize chromosome in ~20 minutes. A few tests. #this command should print all the hosts you are running MPI on and how many cores on each host. If you don?t see multiple hosts you are not spreading across machines. mpiexec hostname | sort | uniq -c #this will let you know if maker is running MPI correctly (should print help message only once) mpiexec maker -h ?Carson > On Oct 16, 2018, at 10:49 PM, Yiyuan Li wrote: > > Dear Maker support, > I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? > > Thank you! > > YY > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Oct 19 11:22:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:22:12 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. ?Carson > On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: > > Hi, > > I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. > > After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). > > This + the backtrace shows that it has something to do with signal handling when running child process from threads. > > After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments > It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. > > > > Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 > I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). > > I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... > > I guess it's the same question for te_proteins. > > > > Cheers > > Anthony > > > > > > Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >> I tried setting this up but there are a number of issues I run into. >> >> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >> RepeatMasker.lib >> RepeatMasker.lib.nhr >> RepeatMasker.lib.nin >> RepeatMasker.lib.nsq >> RepeatMaskerLib.embl >> >> But they do not exist in the share directory. >> >> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >> >> >> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >> >> Another work around is don?t use OpenMPI. Try MPICH3. >> >> >> ?Carson >> >> >> >> >> >>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>> >>> Hi, >>> >>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>> >>> STATUS: Processing and indexing input FASTA files... >>> [cl1n022:06306] *** Process received signal *** >>> [cl1n022:06306] Signal: Segmentation fault (11) >>> [cl1n022:06306] Signal code: Address not mapped (1) >>> [cl1n022:06306] Failing at address: 0x514 >>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>> [cl1n022:06306] *** End of error message *** >>> SIGTERM received >>> SIGTERM received >>> >>> >>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>> >>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>> >>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>> Any help would be highly appreciated! >>> >>> Anthony Bretaudeau >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:25:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:25:40 -0600 Subject: [maker-devel] maker3 In-Reply-To: References: Message-ID: <1D30ACCC-1DC4-451E-8553-8AB8ADA269A2@gmail.com> The maker 3 beta is one of the links when you registre to download maker. IT will be the link directly under the stable release link ?> http://yandell.topaz.genetics.utah.edu/cgi-bin/maker_license.cgi Also you can use grep to pull out specific lines of a gff3 file. Example: grep -P "\tprotein2genome\t" all.gff > protein2genome.gff That command will grab all the protein2genome features out of a file. ?Carson > On Oct 18, 2018, at 2:29 AM, Linnie Linnie wrote: > > Dear maker team, > > I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. > > Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. > > I would appreciate any help. Thank you very much! > > Linnie > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Tue Oct 23 07:56:09 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Tue, 23 Oct 2018 15:56:09 +0200 Subject: [maker-devel] CIGAR string explanation Message-ID: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Hello, Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ -- completed exonerate analysis and here the result we get in the protein2genome.gff output from MAKER @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. Could you explain their meanings ? Best regards, /Jacques ------------------------------------------------- Jacques Dainat, Ph.D. NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service http://nbis.se/about/staff/jacques-dainat http://nbis.se ? Contact ? Address: Uppsala University, Biomedicinska Centrum Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: +46 18 471 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 23 09:55:51 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 23 Oct 2018 09:55:51 -0600 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Message-ID: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. ?Carson > On Oct 23, 2018, at 7:56 AM, Jacques Dainat wrote: > > Hello, > > Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) > > cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 > vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ > -- completed exonerate analysis > > > and here the result we get in the protein2genome.gff output from MAKER > > @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 > @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 > @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 > @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 > @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 > @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 > @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 > @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 > @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 > @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 > @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 > @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 > @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 > @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 > > MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. > Could you explain their meanings ? > > Best regards, > > /Jacques > ------------------------------------------------- > Jacques Dainat, Ph.D. > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > http://nbis.se/about/staff/jacques-dainat > http://nbis.se > > ? Contact ? > Address: Uppsala University, Biomedicinska Centrum > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: +46 18 471 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Wed Oct 24 03:28:52 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Wed, 24 Oct 2018 05:28:52 -0400 Subject: [maker-devel] EVM control file and est2genome Message-ID: Hi, I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. The first one is the following: I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. Does this mean I can only use est2genome with ESTs from the species of interest? The second error relates to EVM: I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: >ERROR: You have failed to provide a value for 'evm' in the control files. Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. I would appreciate any input on any of these issues. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at inria.fr Wed Oct 24 09:07:48 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Wed, 24 Oct 2018 17:07:48 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 09:46:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:46:30 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: <62EBFA6C-4194-4D65-8313-F67EFCAEF47A@gmail.com> It divides up pieces of contigs as well as individual steps. BLAST, exonerate, snap, augustus can each run on separate machines. ?Carson > On Oct 24, 2018, at 9:07 AM, Anthony Bretaudeau wrote: > > Hi, > > I'll see if I can improve the conda recipe. > > Just one simple question: how does Maker divide the work between worker nodes in mpi mode? Is it supposed to be 1 contig per node or are the largest contig splitted into smaller chunks, each one potentially treated on different nodes? From my tests I have the feeling it is the first answer, but I'm not sure if it's normal or not. > > Anthony > > Le 19/10/2018 ? 19:22, Carson Holt a ?crit : >> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. >> >> For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> >> >> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb >> >> te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. >> >> ?Carson >> >> >> >> >>> On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: >>> >>> Hi, >>> >>> I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. >>> >>> After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). >>> >>> This + the backtrace shows that it has something to do with signal handling when running child process from threads. >>> >>> After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments >>> It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. >>> >>> >>> >>> Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 >>> I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). >>> >>> I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... >>> >>> I guess it's the same question for te_proteins. >>> >>> >>> >>> Cheers >>> >>> Anthony >>> >>> >>> >>> >>> >>> Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >>>> I tried setting this up but there are a number of issues I run into. >>>> >>>> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >>>> RepeatMasker.lib >>>> RepeatMasker.lib.nhr >>>> RepeatMasker.lib.nin >>>> RepeatMasker.lib.nsq >>>> RepeatMaskerLib.embl >>>> >>>> But they do not exist in the share directory. >>>> >>>> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >>>> >>>> >>>> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >>>> >>>> Another work around is don?t use OpenMPI. Try MPICH3. >>>> >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> >>>>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>>>> >>>>> STATUS: Processing and indexing input FASTA files... >>>>> [cl1n022:06306] *** Process received signal *** >>>>> [cl1n022:06306] Signal: Segmentation fault (11) >>>>> [cl1n022:06306] Signal code: Address not mapped (1) >>>>> [cl1n022:06306] Failing at address: 0x514 >>>>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>>>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>>>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>>>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>>>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>>>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>>>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>>>> [cl1n022:06306] *** End of error message *** >>>>> SIGTERM received >>>>> SIGTERM received >>>>> >>>>> >>>>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>>>> >>>>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>>>> >>>>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>>>> Any help would be highly appreciated! >>>>> >>>>> Anthony Bretaudeau >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 09:50:43 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:50:43 -0600 Subject: [maker-devel] EVM control file and est2genome In-Reply-To: References: Message-ID: <3CB0FDB0-8B7D-4CF8-B957-5935166D5305@gmail.com> est2genome only works with the data given to est=. For the second error, you must provide the path of the evm executable in maker_exe.ctl. It apparently was not in your PATH, so it didn?t get automatically filled out. Here is an example from the wiki of using est2genome and protein2genome to train SNAP for the next MAKER run ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Oct 24, 2018, at 3:28 AM, Linnie Linnie wrote: > > Hi, > > I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. > > The first one is the following: > I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: > > >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. > > Does this mean I can only use est2genome with ESTs from the species of interest? > > The second error relates to EVM: > I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: > > >ERROR: You have failed to provide a value for 'evm' in the control files. > > Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? > > > And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. > > I would appreciate any input on any of these issues. > > Thank you! > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Wed Oct 24 02:41:05 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Wed, 24 Oct 2018 10:41:05 +0200 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Message-ID: Thanks for your response. It?s surprising the link in the Sequence Ontology web site doesn?t work anymore. I will notify them. I was surprise that I was not able finding any resource on internet describing these values. Helped by your answer I have refined my key words and googled again, and I finnaly found old ressources describing that too. from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal I put a copy here of the Wormbase description in case those resources also disappear. At that time it sounds it was not yet officialy accepted by the SO. /Jacques > On 23 Oct 2018, at 17:55, Carson Holt wrote: > > Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. > > ?Carson > > >> On Oct 23, 2018, at 7:56 AM, Jacques Dainat > wrote: >> >> Hello, >> >> Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) >> >> cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 >> vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ >> -- completed exonerate analysis >> >> >> and here the result we get in the protein2genome.gff output from MAKER >> >> @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 >> @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 >> @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 >> @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 >> @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 >> @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 >> @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 >> @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 >> @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 >> @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 >> @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 >> @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 >> @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 >> @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 >> >> MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. >> Could you explain their meanings ? >> >> Best regards, >> >> /Jacques >> ------------------------------------------------- >> Jacques Dainat, Ph.D. >> NBIS (National Bioinformatics Infrastructure Sweden) >> Genome Annotation Service >> http://nbis.se/about/staff/jacques-dainat >> http://nbis.se >> >> ? Contact ? >> Address: Uppsala University, Biomedicinska Centrum >> Department of Medical Biochemistry Microbiology, Genomics >> Husargatan 3, box 582 >> S-75123 Uppsala Sweden >> Phone: +46 18 471 46 25 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2018-10-24 at 10.00.41.png Type: image/png Size: 281561 bytes Desc: not available URL: From elyssa_garza at yahoo.com Wed Oct 24 15:27:50 2018 From: elyssa_garza at yahoo.com (Elyssa Garza) Date: Wed, 24 Oct 2018 21:27:50 +0000 (UTC) Subject: [maker-devel] Is gene retrieval from gff possible? In-Reply-To: <1576161756.398305.1540414096080@mail.yahoo.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <1576161756.398305.1540414096080@mail.yahoo.com> Message-ID: <1888825059.421524.1540416470195@mail.yahoo.com> Hello I recently annotated my plant genome and am looking at retrieving a particular set of genes from the maker results. I have a list of TAIR Ids that I am particularly interested in and was thinking about using the gff file to help pull out the associated transcripts. I was wondering if you could advise me on the best or easiest way of obtaining the associated TAIR accession or gene model from the gff file. I did try looking at the genes (41,779 genes) using CLCbio but the accessions were not easily identified or found. I also looked at the protein matches (819,805 protein matches) and was able to easily find gene model matches corresponding to my target accessions. Is it wise to do this? Can you explain why I can't find these same protein matches in the gene file? I have some ideas on why this is happening but I am looking for support for them. Elyssa -------------- next part -------------- An HTML attachment was scrubbed... URL: From pallavi.gupta at slu.edu Thu Oct 25 15:22:31 2018 From: pallavi.gupta at slu.edu (Pallavi Gupta) Date: Thu, 25 Oct 2018 21:22:31 +0000 Subject: [maker-devel] Issue with maker Message-ID: Hi Team MAKER, I am using maker for my research for genome annotation process. But when I run maker I am getting a weird error. I tried finding a work around on the internet by scrolling through various bioinformatics forum but I was unsuccessful. I will really appreciate if you can help me in this regard. I have attached my nohup.out log. Please let me know if you need anything else. Thanks, Pallavi Gupta -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nohup.out Type: application/octet-stream Size: 26365432 bytes Desc: nohup.out URL: From 17na34 at queensu.ca Wed Oct 31 08:27:44 2018 From: 17na34 at queensu.ca (Nikolay Alabi) Date: Wed, 31 Oct 2018 14:27:44 +0000 Subject: [maker-devel] MAKER not running properly after installation, help needed Message-ID: Hello, I am attempting to annotate a garlic mustard genome using maker on a cluster at Queen?s University. I have been following the tutorial on wiki and was attempting to use the practice data to see if the program is running properly and to learn how to train the gene predicting system. Maker is now installed and is working to an extent, however when in use it is not working properly and cannot read/annotate a genome. I suspect two problems that is causing this to occur, first, anytime any maker command is called, it shows that an argument in forks.pm in perl5 is not correct, after trying to fix the problem, I see that the code should be correct, but the error line still occurs. Then every time a maker command Is called another error saying there is an error flow occurring somewhere in perl again. For instance when I command: maker -h, or maker -CTL or anything to do with maker, the error lines occur. Would you advise me to reinstall perl and bioperl? Other than that I believe everything else is properly installed and I do not understand why the program is not running properly. I have even tried using different data genomes, however the same problem occurs of the run never finishing, then retrying, and ultimately failing. Please let me know if there is another possible source of error in the problem. Best regards, Nikolay -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Mon Oct 1 18:00:43 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Tue, 2 Oct 2018 10:00:43 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. > Although the results make sense in general, I have reasons to suspect some > gene models are wrong and would like your help in understanding and > optimizing the results. > My research project involves the annotation of multiple tomato varieties > (individuals) which are a bit different from the published reference > genome. To this end, I created de-novo assemblies of these genomes and also > generated an evidence set to be used as input for Maker. Evidence consist > of a large set of transcripts from various tomato varieties and conditions, > as well as full protein sets from 6 plant species, including the proteins > derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my > evidence data and Augustus as gene predictor. This should allow me to > compare my result to the ITAG annotation, which I assume to be the > "correct" answer, and see how well I'm doing. I should mention that ITAG > annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. > Specifically, I ran an all-vs-all blast and took the top hits. I discovered > that only about 70% of the ITAG proteins are covered by a protein from my > result with a high quality alignment (evalue > 10e-5, coverage > 90%). I > further investigated by running BUSCO on both protein sets and looking at > BUSCOs found in ITAG but missing in my result. Attached is a screenshot > from a genome browser where you can see such a case. Top track is the ITAG > gene model, below is my result. Third track is the protein evidence > alignments (i.e blastx and protein2genome features), and bottom track are > masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult > case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my > result. This is in fact the reason I ended up with a truncated protein and > a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of > protein evidence supporting this region as a CDS. Can you help me figure > out why is the result so? Could it be due to the small repeats detected in > this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Tue Oct 2 00:50:32 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Tue, 2 Oct 2018 09:50:32 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? :? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in > your genome is extremely important, specially in genomes with a lot of > repeats. If the repeat library does not have an appropriate coverage, > anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a > huge impact. Aside of the repeats, BUSCO-generated gene models can > under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can > be very different if you provide an haploid assembly vs haploid + alternate > contigs vs diploid. In general, a purely haploid assembly generates a less > biased model as it has lower number of duplicated conserved genes present, > that will unbalance the gene model towards them. (at least in BUSCO-based > models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis > and may require more than a bit of curation... usually increasing with more > complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: > >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. >> Although the results make sense in general, I have reasons to suspect some >> gene models are wrong and would like your help in understanding and >> optimizing the results. >> My research project involves the annotation of multiple tomato varieties >> (individuals) which are a bit different from the published reference >> genome. To this end, I created de-novo assemblies of these genomes and also >> generated an evidence set to be used as input for Maker. Evidence consist >> of a large set of transcripts from various tomato varieties and conditions, >> as well as full protein sets from 6 plant species, including the proteins >> derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my >> evidence data and Augustus as gene predictor. This should allow me to >> compare my result to the ITAG annotation, which I assume to be the >> "correct" answer, and see how well I'm doing. I should mention that ITAG >> annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. >> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >> that only about 70% of the ITAG proteins are covered by a protein from my >> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >> further investigated by running BUSCO on both protein sets and looking at >> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >> from a genome browser where you can see such a case. Top track is the ITAG >> gene model, below is my result. Third track is the protein evidence >> alignments (i.e blastx and protein2genome features), and bottom track are >> masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult >> case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >> my result. This is in fact the reason I ended up with a truncated protein >> and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of >> protein evidence supporting this region as a CDS. Can you help me figure >> out why is the result so? Could it be due to the small repeats detected in >> this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Oct 2 22:39:40 2018 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 3 Oct 2018 14:39:40 +1000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the > ITAG annotation as repeats library, so I expect these to be quite > appropriate. I guess my question is regarding the way Maker makes > decisions: Is the fact that some repeats (simple repeats in this case) were > predicted is enough to change a CDS into a UTR, despite sufficient protein > evidence? > I did not train Augustus myself, rather I used the species (tomato) > profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with > that since cultivated tomato varieties are repeatedly selfed, so they are > (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? xvazquezc at gmail.com??>:? > >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in >> your genome is extremely important, specially in genomes with a lot of >> repeats. If the repeat library does not have an appropriate coverage, >> anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a >> huge impact. Aside of the repeats, BUSCO-generated gene models can >> under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus >> can be very different if you provide an haploid assembly vs haploid + >> alternate contigs vs diploid. In general, a purely haploid assembly >> generates a less biased model as it has lower number of duplicated >> conserved genes present, that will unbalance the gene model towards them. >> (at least in BUSCO-based models, but it should be extensible to any >> Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis >> and may require more than a bit of curation... usually increasing with more >> complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >> >>> Hi MAKER users, >>> I am new to Maker and had just finished running my first annotations. >>> Although the results make sense in general, I have reasons to suspect some >>> gene models are wrong and would like your help in understanding and >>> optimizing the results. >>> My research project involves the annotation of multiple tomato varieties >>> (individuals) which are a bit different from the published reference >>> genome. To this end, I created de-novo assemblies of these genomes and also >>> generated an evidence set to be used as input for Maker. Evidence consist >>> of a large set of transcripts from various tomato varieties and conditions, >>> as well as full protein sets from 6 plant species, including the proteins >>> derived from the annotation of the reference - called ITAG. >>> For an initial QA, I tried annotating the reference genome using my >>> evidence data and Augustus as gene predictor. This should allow me to >>> compare my result to the ITAG annotation, which I assume to be the >>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>> annotation was also created using Maker, followed by manual curation. >>> I started by comparing the protein sets from my result and the ITAT set. >>> Specifically, I ran an all-vs-all blast and took the top hits. I discovered >>> that only about 70% of the ITAG proteins are covered by a protein from my >>> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I >>> further investigated by running BUSCO on both protein sets and looking at >>> BUSCOs found in ITAG but missing in my result. Attached is a screenshot >>> from a genome browser where you can see such a case. Top track is the ITAG >>> gene model, below is my result. Third track is the protein evidence >>> alignments (i.e blastx and protein2genome features), and bottom track are >>> masked repeats. >>> As you can see, there seems to be two issues with my result: >>> 1. The two genes in ITAG were fused into one. I guess this is a >>> difficult case as the genes are really close together. >>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>> my result. This is in fact the reason I ended up with a truncated protein >>> and a missing BUSCO. >>> This is a bit surprising to me, since there seems to be quite a lot of >>> protein evidence supporting this region as a CDS. Can you help me figure >>> out why is the result so? Could it be due to the small repeats detected in >>> this region? >>> Any ideas on how my result can be improved without manual curation? >>> >>> Many thanks! >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>> >> >> >> -- >> Xabier V?zquez-Campos, *PhD* >> *Research Associate* >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 17:52:47 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 17:52:47 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: Message-ID: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: > Hi Xabier, and thanks for your reply. > I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? > I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? > As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. > > ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? > Hi Lior, > > without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected > > The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict > https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 > And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) > > Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. > > Cheers, > Xabi > > On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: > Hi MAKER users, > I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. > My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. > For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. > I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. > As you can see, there seems to be two issues with my result: > 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. > 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. > This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? > Any ideas on how my result can be improved without manual curation? > > Many thanks! > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From myandell at genetics.utah.edu Thu Oct 4 18:05:04 2018 From: myandell at genetics.utah.edu (Mark Yandell) Date: Fri, 5 Oct 2018 00:05:04 +0000 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Cheers! From: maker-devel on behalf of Carson Holt Date: Thursday, October 4, 2018 at 5:52 PM To: Lior Glick Cc: Maker Mailing List Subject: Re: [maker-devel] Help debugging a MAKER result I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. ?Carson On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: Yeah, tomato should be rather well annotated. I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: Hi Xabier, and thanks for your reply. I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos? ?>: Hi Lior, without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. Cheers, Xabi On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: Hi MAKER users, I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. As you can see, there seems to be two issues with my result: 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? Any ideas on how my result can be improved without manual curation? Many thanks! _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -- Xabier V?zquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Oct 4 18:09:58 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 4 Oct 2018 18:09:58 -0600 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: One correction. I meant to say set unmask=1. ?Carson > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it?s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case ?soft-masking? affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we?ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other. > > ?Carson > > >> On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: >> >> Yeah, tomato should be rather well annotated. >> >> I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things >> >> You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive. >> >> I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions. >> >> On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation >> >> >> On Tue, 2 Oct 2018 at 16:50, Lior Glick > wrote: >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence? >> I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? ??>:? >> Hi Lior, >> >> without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected >> >> The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict >> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >> And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model) >> >> Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes. >> >> Cheers, >> Xabi >> >> On Tue, 2 Oct 2018 at 05:23, Lior Glick > wrote: >> Hi MAKER users, >> I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results. >> My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG. >> For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation. >> I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats. >> As you can see, there seems to be two issues with my result: >> 1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together. >> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO. >> This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region? >> Any ideas on how my result can be improved without manual curation? >> >> Many thanks! >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From liorglic at mail.tau.ac.il Fri Oct 5 00:51:41 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Fri, 5 Oct 2018 09:51:41 +0300 Subject: [maker-devel] Help debugging a MAKER result In-Reply-To: References: <597F5B29-71BF-409D-B5E2-E1D4611953C3@gmail.com> Message-ID: Thank you both for your helpful ideas. I'm going to give them a try and see how this effects my results. Will update when I have them. Cheers indeed. ??????? ??? ??, 5 ????? 2018 ?-3:10 ??? ?Carson Holt?? :? > One correction. I meant to say set unmask=1. > > ?Carson > > > On Oct 4, 2018, at 5:52 PM, Carson Holt wrote: > > I?d just like to add info on how MAKER builds predictions. MAKER itself > does not generate models. In your case, Augustus produces the models. > Augustus will run twice. Once on it?s own (this will be on a repeat masked > version of the assembly), and once again where MAKER provides it with a > hints file as part of the command line used to run Augustus. The hints file > is generated from the evidence alignments you provided to MAKER. The hints > usually get Augustus to perform a little better than it does with training > alone on a masked assembly. > > Under-masking or overmasking the assembly can both confound Augustus. > MAKER hard masks complex repeats in the assembly (turns them from ATCG into > N?s), and soft-masks simple repeats (turns ATCG into lower case actg). The > lower case ?soft-masking? affects BLAST alignment but not Augustus > predictions (Augustus ignores it). MAKER also removes the hard-masking when > it runs Augustus with the hints file. This is done because we?ve > constrained Augustus to a smaller padded evidence cluster at the locus, and > Augustus can no longer see the whole assembly. > > If you want to explore how masking affects the models, you can > set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked > assembly). You can then look at contigs in a browser to see how the masked > vs unmasked models compare to each other. > > ?Carson > > > On Oct 2, 2018, at 10:39 PM, Xabier V?zquez-Campos > wrote: > > Yeah, tomato should be rather well annotated. > > I would double check how good was the tomato genome at the time of the > creation of the gene model. Also, creating a new Augustus model based on > the first prediction run might improve things > > You have tomato on repbase. To be sure you are not missing anything, I > would still run the advanced repeat library protocol, if it isn't > computationally prohibitive. > > I don't know how good is SNAP for plant genomes, so it could be worth to > try on top of the Augustus predictions. > > On top of this, I'd take a look into reference-based annotation tools like > RATT. This would annotate all the common regions with the reference and > then curate only on the regions that cannot be annotated from the reference > using your Maker annotation > > > On Tue, 2 Oct 2018 at 16:50, Lior Glick wrote: > >> Hi Xabier, and thanks for your reply. >> I forgot to mention it, but I used the annotated repeats derived from the >> ITAG annotation as repeats library, so I expect these to be quite >> appropriate. I guess my question is regarding the way Maker makes >> decisions: Is the fact that some repeats (simple repeats in this case) were >> predicted is enough to change a CDS into a UTR, despite sufficient protein >> evidence? >> I did not train Augustus myself, rather I used the species (tomato) >> profile that comes with the Augustus release. Does that make sense? >> As for the haploid/diploid issue - fortunately I don't have to deal with >> that since cultivated tomato varieties are repeatedly selfed, so they are >> (almost) completely homozygous. >> >> ??????? ??? ??, 2 ????? 2018 ?-3:01 ??? ?Xabier V?zquez-Campos?? > xvazquezc at gmail.com??>:? >> >>> Hi Lior, >>> >>> without getting in a lot of detail a good model covering the repeats in >>> your genome is extremely important, specially in genomes with a lot of >>> repeats. If the repeat library does not have an appropriate coverage, >>> anything based on the masked genome will be affected >>> >>> The evidence you pass into Augustus to generate the gene model can have >>> a huge impact. Aside of the repeats, BUSCO-generated gene models can >>> under-predict >>> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8 >>> And we have seen in our lab that the gene models generated by Augustus >>> can be very different if you provide an haploid assembly vs haploid + >>> alternate contigs vs diploid. In general, a purely haploid assembly >>> generates a less biased model as it has lower number of duplicated >>> conserved genes present, that will unbalance the gene model towards them. >>> (at least in BUSCO-based models, but it should be extensible to any >>> Augustus model) >>> >>> Note that in the end the generated annotation is just a model/hypothesis >>> and may require more than a bit of curation... usually increasing with more >>> complex genomes. >>> >>> Cheers, >>> Xabi >>> >>> On Tue, 2 Oct 2018 at 05:23, Lior Glick wrote: >>> >>>> Hi MAKER users, >>>> I am new to Maker and had just finished running my first annotations. >>>> Although the results make sense in general, I have reasons to suspect some >>>> gene models are wrong and would like your help in understanding and >>>> optimizing the results. >>>> My research project involves the annotation of multiple tomato >>>> varieties (individuals) which are a bit different from the published >>>> reference genome. To this end, I created de-novo assemblies of these >>>> genomes and also generated an evidence set to be used as input for Maker. >>>> Evidence consist of a large set of transcripts from various tomato >>>> varieties and conditions, as well as full protein sets from 6 plant >>>> species, including the proteins derived from the annotation of the >>>> reference - called ITAG. >>>> For an initial QA, I tried annotating the reference genome using my >>>> evidence data and Augustus as gene predictor. This should allow me to >>>> compare my result to the ITAG annotation, which I assume to be the >>>> "correct" answer, and see how well I'm doing. I should mention that ITAG >>>> annotation was also created using Maker, followed by manual curation. >>>> I started by comparing the protein sets from my result and the ITAT >>>> set. Specifically, I ran an all-vs-all blast and took the top hits. I >>>> discovered that only about 70% of the ITAG proteins are covered by a >>>> protein from my result with a high quality alignment (evalue > 10e-5, >>>> coverage > 90%). I further investigated by running BUSCO on both protein >>>> sets and looking at BUSCOs found in ITAG but missing in my result. Attached >>>> is a screenshot from a genome browser where you can see such a case. Top >>>> track is the ITAG gene model, below is my result. Third track is the >>>> protein evidence alignments (i.e blastx and protein2genome features), and >>>> bottom track are masked repeats. >>>> As you can see, there seems to be two issues with my result: >>>> 1. The two genes in ITAG were fused into one. I guess this is a >>>> difficult case as the genes are really close together. >>>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in >>>> my result. This is in fact the reason I ended up with a truncated protein >>>> and a missing BUSCO. >>>> This is a bit surprising to me, since there seems to be quite a lot of >>>> protein evidence supporting this region as a CDS. Can you help me figure >>>> out why is the result so? Could it be due to the small repeats detected in >>>> this region? >>>> Any ideas on how my result can be improved without manual curation? >>>> >>>> Many thanks! >>>> _______________________________________________ >>>> maker-devel mailing list >>>> maker-devel at box290.bluehost.com >>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >>> >>> >>> -- >>> Xabier V?zquez-Campos, *PhD* >>> *Research Associate* >>> NSW Systems Biology Initiative >>> School of Biotechnology and Biomolecular Sciences >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >>> >> > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 5 14:37:34 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 5 Oct 2018 14:37:34 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: Message-ID: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> I tried setting this up but there are a number of issues I run into. First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> RepeatMasker.lib RepeatMasker.lib.nhr RepeatMasker.lib.nin RepeatMasker.lib.nsq RepeatMaskerLib.embl But they do not exist in the share directory. Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. Another work around is don?t use OpenMPI. Try MPICH3. ?Carson > On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau wrote: > > Hi, > > I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: > > STATUS: Processing and indexing input FASTA files... > [cl1n022:06306] *** Process received signal *** > [cl1n022:06306] Signal: Segmentation fault (11) > [cl1n022:06306] Signal code: Address not mapped (1) > [cl1n022:06306] Failing at address: 0x514 > [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] > [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] > [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] > [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] > [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] > [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] > [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] > [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] > [cl1n022:06306] *** End of error message *** > SIGTERM received > SIGTERM received > > > As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. > > As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. > > Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe > Any help would be highly appreciated! > > Anthony Bretaudeau > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 10:34:22 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 16:34:22 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> Message-ID: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 11:31:04 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 17:31:04 +0000 Subject: [maker-devel] maker problem In-Reply-To: <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> Message-ID: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 11:45:31 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 17:45:31 +0000 Subject: [maker-devel] maker problem In-Reply-To: <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Mon Oct 8 12:08:49 2018 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Mon, 8 Oct 2018 18:08:49 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 13:12:27 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 19:12:27 +0000 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: ok, let me explain my case. Genome- eukaryote We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. atleast=transcripts.fasta (from in-house sequenced genome (already published)) est2genome=1 protein2genome=1 Sorry for not explaining my case initially. What can be other files I can use as est evidence? Can I use Augustus generated hints for gene prediction along with above options? Your thoughts?? Parul On Oct 8, 2018, at 1:08 PM, Carson Hinton Holt > wrote: Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes). ?Carson On Oct 8, 2018, at 11:45 AM, Carson Holt > wrote: Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec). Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it?s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials. Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser). ?Carson On Oct 8, 2018, at 11:31 AM, Gupta, Parul > wrote: Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. Below is the example of my datastore_index.log file for that scaffold : ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED Output directory of that scaffold looks like: [Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll total 160 drwxr-xr-x 3 guptapa pi 3 Oct 5 15:51 ../ -rw-r--r-- 1 guptapa pi 27740 Oct 5 15:51 run.log -rw-r--r-- 1 guptapa pi 34268 Oct 5 15:51 ScJhAqd_1%3BHRSCAF=2.gff drwxr-xr-x 2 guptapa pi 75 Oct 5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/ drwxr-xr-x 3 guptapa pi 5 Oct 5 15:51 ./ gff looks like: Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff ##gff-version 3 ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86; ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36; ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27; Regards, Parul On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt > wrote: GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments). Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so ?> contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the ?/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages). If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) ?> https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ You can also browse through the archive for more info on training SNAP and Augustus. ?Carson On Oct 8, 2018, at 10:12 AM, Gupta, Parul > wrote: Hi Carson, As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don?t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion. Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? Thanks, Parul On Oct 4, 2018, at 6:43 PM, Gupta, Parul > wrote: Thank you Carson. Sent from my iPad On Oct 4, 2018, at 3:11 PM, Carson Holt > wrote: You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with. If you don?t provide a prediction method, MAKER will align evidence, but you won?t get any gene models. Example: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson On Oct 1, 2018, at 1:05 PM, Gupta, Parul > wrote: Hi Carson, I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round : genome=masked_genome.fasta est=transcripts.fasta (from same species for which genome fasta is provided) atleast=transcripts.fasta (from alternative organism) protein=proteins.fasta Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting? In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many ?RETRY? and ?FAILED? scaffolds. FYI, I subscribed to "maker-devel" google group but "new topic? button is greyed out. Yours suggestion?? Thanks in advance. Parul -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Oct 8 14:11:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 8 Oct 2018 14:11:26 -0600 Subject: [maker-devel] maker problem In-Reply-To: References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> Message-ID: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> > We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. > Transcripts- > We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > Proteins- > I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. > atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson From liorglic at mail.tau.ac.il Wed Oct 17 08:27:06 2018 From: liorglic at mail.tau.ac.il (Lior Glick) Date: Wed, 17 Oct 2018 17:27:06 +0300 Subject: [maker-devel] Problem compiling MAKER with Intel MPI Message-ID: Hello, I am trying to compile MAKER with Intel MPI. We are using a cluster based on Intel x86_64 architecture and using lmod for environment variables. All required dependencies have already been installed and the initial 'perl Build.PL' passes without issues (see attached). When running './Build install' it always fails to find 'sys/types.h' and exits (see additional attachment). The Build command probably searches for the '/usr/include/sys/types.h' file, but no matter which variable (INCLUDE, PERL5LIB etc...) I update with the required path (either '/usr/include' or '/usr/include/sys') - it keeps failing. I would appreciate your input. Thanks a lot! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build.PL.out Type: application/octet-stream Size: 2033 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Build_install.out Type: application/octet-stream Size: 6313 bytes Desc: not available URL: From anthony.bretaudeau at inria.fr Thu Oct 18 07:52:03 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Thu, 18 Oct 2018 15:52:03 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From Parul.Gupta at oregonstate.edu Mon Oct 8 14:40:06 2018 From: Parul.Gupta at oregonstate.edu (Gupta, Parul) Date: Mon, 8 Oct 2018 20:40:06 +0000 Subject: [maker-devel] maker problem In-Reply-To: <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> Message-ID: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. I used est_fasta not the est_gff. Find a contig with protein2genome results in the GFF3 yes I can see protein2genome results in gff3: ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; and est2genome in gff3 as well: ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; Thanks, Parul On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. Transcripts- We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. Proteins- I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. atleast=transcripts.fasta (from in-house sequenced genome (already published)) These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Fri Oct 12 02:23:07 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Fri, 12 Oct 2018 10:23:07 +0200 Subject: [maker-devel] maker-level google group Message-ID: Dear maker team, I hope this email finds you well. I am a member of the maker-devel google group, but, somehow, I cannot post questions. Is there anything I can do on my end to fix this? Also, I was wondering where can I download maker3 (I cannot seem to find it online). I have been using maker2, but I wanted to use EVM, and I have read that maker3 implements it. Thank you so much for your help, Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From yli at utexas.edu Tue Oct 16 22:49:13 2018 From: yli at utexas.edu (Yiyuan Li) Date: Tue, 16 Oct 2018 23:49:13 -0500 Subject: [maker-devel] Speed up maker annotation on long scaffolds Message-ID: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Dear Maker support, I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? Thank you! YY From peachandolives at gmail.com Thu Oct 18 02:29:57 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Thu, 18 Oct 2018 10:29:57 +0200 Subject: [maker-devel] maker3 Message-ID: Dear maker team, I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. I would appreciate any help. Thank you very much! Linnie -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:02:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:02:22 -0600 Subject: [maker-devel] maker problem In-Reply-To: <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> References: <189553EC-9D1F-4C2F-8672-3562C8A4A088@oregonstate.edu> <177AD833-BA97-40CD-B500-1AD4531DE41A@gmail.com> <258E6D0D-6A34-42E2-91F3-F7693ED42E7C@genetics..utah.edu> <90D768D2-F911-4BA3-A8C5-1DAE79566114@oregonstate.edu> <41ABA575-D58A-4FE0-83CD-9312617AA635@gmail.com> <20878280-1B0C-4CC5-BD92-20FB57A44662@oregonstate.edu> Message-ID: <3F78E884-11AF-4291-A8FC-D81F6F55B47D@gmail.com> Once Augustus is trained it will have a new species directory under ?/augustus/config/species/ for the organism you just trained. Or if you trained augustus elsewhere (website, BUSCO, etc.) you have to copy the species data there. Then you just supply the species name and Augustus automatically finds it (see Augustus documentation on training). For est2genome=1 and protein2genome=1, MAKER takes the alignments from exonerate protein2genome and est2genome and if they are mostly open reading frame, just turns them directly into gene/mRNA/exon/CDS models. If there are none of those in the resulting GFF3 but there are est2genome and protein2genome alignments then all of them have broken ORF. That means there are serious issues with your assembly, or with the est fasta or protein fasta file. For a protein fasta, I recomend using uniprot/swissprot because it is manually curated and contains a broad dataset. But if you cannot get gene models from uniprot/swissprot protein2genome alignments, then your assembly has issues (either too fragmented, lots of errors inducing random stop codons, or lots of N?s interspersed in the sequence). ?Carson > On Oct 8, 2018, at 2:40 PM, Gupta, Parul wrote: > > I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file? I have augustus.gff as predicted hints. > >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. > > I used est_fasta not the est_gff. > >> Find a contig with protein2genome results in the GFF3 > > yes I can see protein2genome results in gff3: > > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 31566 32621 > 1426 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31566 > 31775 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 31872 > 32621 1426 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 33816 35829 > 1394 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 34916 > 35829 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 33816 > 34182 1394 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > protein_match 49636 51466 > 1091 - > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA; > ScJhAqd_2184%3BHRSCAF%3D3164 > protein2genome > match_part 51354 > 51466 1091 > - . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2; > > and est2genome in gff3 as well: > > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16239 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889982 > 48890708 16239 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48887305 48890708 > 16412 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48887305 > 48889881 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > match_part 48889949 > 48890708 16412 > + . > ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760; > ScJhAqd_2184%3BHRSCAF%3D3164 > est2genome > expressed_sequence_match > 48895479 48899036 > 9582 + > . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280; > > Thanks, > Parul > >> On Oct 8, 2018, at 3:11 PM, Carson Holt > wrote: >> >> >>> We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl. >> >> Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER. >> >> >>> Transcripts- >>> We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl. These assembled transcripts may have redundancy. >> >> est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training. >> >> >>> Proteins- >>> I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl. >> >> Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3 (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don?t find any, then the issue is either your pre-masking or the evidence proteins you gave. I?d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins. >> >> >>> atleast=transcripts.fasta (from in-house sequenced genome (already published)) >> >> These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor). >> >> ?Carson >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:09:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:09:30 -0600 Subject: [maker-devel] Speed up maker annotation on long scaffolds In-Reply-To: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> References: <4361720F-0F1B-43DA-8931-218CCCD71AF4@utexas.edu> Message-ID: <28BAD1D1-77BA-4F50-A54F-7E402589E76F@gmail.com> You might not have MPI setup correctly. MPI spread across 10 machines (20 cores each) can annotate an entire maize chromosome in ~20 minutes. A few tests. #this command should print all the hosts you are running MPI on and how many cores on each host. If you don?t see multiple hosts you are not spreading across machines. mpiexec hostname | sort | uniq -c #this will let you know if maker is running MPI correctly (should print help message only once) mpiexec maker -h ?Carson > On Oct 16, 2018, at 10:49 PM, Yiyuan Li wrote: > > Dear Maker support, > I have a quick question about annotating chromosome-level scaffolds. I have a new genome assembly from Hi-C data. The top 4 scaffolds are chromosome-level, which are ~100-170M bp long. I tried to use Maker MPI but it runs slow. Each scaffold has been running for weeks. I was wondering if you may have any suggestions on how to make the annotation process faster? > > Thank you! > > YY > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Oct 19 11:22:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:22:12 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> Message-ID: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. ?Carson > On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: > > Hi, > > I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. > > After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). > > This + the backtrace shows that it has something to do with signal handling when running child process from threads. > > After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments > It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. > > > > Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 > I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). > > I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... > > I guess it's the same question for te_proteins. > > > > Cheers > > Anthony > > > > > > Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >> I tried setting this up but there are a number of issues I run into. >> >> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >> RepeatMasker.lib >> RepeatMasker.lib.nhr >> RepeatMasker.lib.nin >> RepeatMasker.lib.nsq >> RepeatMaskerLib.embl >> >> But they do not exist in the share directory. >> >> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >> >> >> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >> >> Another work around is don?t use OpenMPI. Try MPICH3. >> >> >> ?Carson >> >> >> >> >> >>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>> >>> Hi, >>> >>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>> >>> STATUS: Processing and indexing input FASTA files... >>> [cl1n022:06306] *** Process received signal *** >>> [cl1n022:06306] Signal: Segmentation fault (11) >>> [cl1n022:06306] Signal code: Address not mapped (1) >>> [cl1n022:06306] Failing at address: 0x514 >>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>> [cl1n022:06306] *** End of error message *** >>> SIGTERM received >>> SIGTERM received >>> >>> >>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>> >>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>> >>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>> Any help would be highly appreciated! >>> >>> Anthony Bretaudeau >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Oct 19 11:25:40 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 19 Oct 2018 11:25:40 -0600 Subject: [maker-devel] maker3 In-Reply-To: References: Message-ID: <1D30ACCC-1DC4-451E-8553-8AB8ADA269A2@gmail.com> The maker 3 beta is one of the links when you registre to download maker. IT will be the link directly under the stable release link ?> http://yandell.topaz.genetics.utah.edu/cgi-bin/maker_license.cgi Also you can use grep to pull out specific lines of a gff3 file. Example: grep -P "\tprotein2genome\t" all.gff > protein2genome.gff That command will grab all the protein2genome features out of a file. ?Carson > On Oct 18, 2018, at 2:29 AM, Linnie Linnie wrote: > > Dear maker team, > > I am trying to run maker and use its input for EVM. From the EVM website, I gather that I need to provide it with .gff files. Maker2 does output one .gff, but I was wondering how to produce .gff files for the proteins and ETS data. > > Alternatively, I have read that maker3 implements EVM. I would be happy to try this option, but I don't know where can I download maker3 from. > > I would appreciate any help. Thank you very much! > > Linnie > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Tue Oct 23 07:56:09 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Tue, 23 Oct 2018 15:56:09 +0200 Subject: [maker-devel] CIGAR string explanation Message-ID: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Hello, Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ -- completed exonerate analysis and here the result we get in the protein2genome.gff output from MAKER @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. Could you explain their meanings ? Best regards, /Jacques ------------------------------------------------- Jacques Dainat, Ph.D. NBIS (National Bioinformatics Infrastructure Sweden) Genome Annotation Service http://nbis.se/about/staff/jacques-dainat http://nbis.se ? Contact ? Address: Uppsala University, Biomedicinska Centrum Department of Medical Biochemistry Microbiology, Genomics Husargatan 3, box 582 S-75123 Uppsala Sweden Phone: +46 18 471 46 25 -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Oct 23 09:55:51 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 23 Oct 2018 09:55:51 -0600 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> Message-ID: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. ?Carson > On Oct 23, 2018, at 7:56 AM, Jacques Dainat wrote: > > Hello, > > Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) > > cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 > vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ > -- completed exonerate analysis > > > and here the result we get in the protein2genome.gff output from MAKER > > @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 > @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 > @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 > @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 > @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 > @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 > @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 > @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 > @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 > @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 > @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 > @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 > @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 > @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 > > MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. > Could you explain their meanings ? > > Best regards, > > /Jacques > ------------------------------------------------- > Jacques Dainat, Ph.D. > NBIS (National Bioinformatics Infrastructure Sweden) > Genome Annotation Service > http://nbis.se/about/staff/jacques-dainat > http://nbis.se > > ? Contact ? > Address: Uppsala University, Biomedicinska Centrum > Department of Medical Biochemistry Microbiology, Genomics > Husargatan 3, box 582 > S-75123 Uppsala Sweden > Phone: +46 18 471 46 25 > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From peachandolives at gmail.com Wed Oct 24 03:28:52 2018 From: peachandolives at gmail.com (Linnie Linnie) Date: Wed, 24 Oct 2018 05:28:52 -0400 Subject: [maker-devel] EVM control file and est2genome Message-ID: Hi, I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. The first one is the following: I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. Does this mean I can only use est2genome with ESTs from the species of interest? The second error relates to EVM: I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: >ERROR: You have failed to provide a value for 'evm' in the control files. Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. I would appreciate any input on any of these issues. Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From anthony.bretaudeau at inria.fr Wed Oct 24 09:07:48 2018 From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau) Date: Wed, 24 Oct 2018 17:07:48 +0200 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 09:46:30 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:46:30 -0600 Subject: [maker-devel] Segfault with OpenMPI In-Reply-To: References: <8D151E3B-353F-4FD5-94DB-95C1125A8176@gmail.com> <78C1AB95-8D23-4D71-939B-5B68666BE5B7@gmail.com> Message-ID: <62EBFA6C-4194-4D65-8313-F67EFCAEF47A@gmail.com> It divides up pieces of contigs as well as individual steps. BLAST, exonerate, snap, augustus can each run on separate machines. ?Carson > On Oct 24, 2018, at 9:07 AM, Anthony Bretaudeau wrote: > > Hi, > > I'll see if I can improve the conda recipe. > > Just one simple question: how does Maker divide the work between worker nodes in mpi mode? Is it supposed to be 1 contig per node or are the largest contig splitted into smaller chunks, each one potentially treated on different nodes? From my tests I have the feeling it is the first answer, but I'm not sure if it's normal or not. > > Anthony > > Le 19/10/2018 ? 19:22, Carson Holt a ?crit : >> Repeatmasker does some data prep during installation (creates new files in the process), and that does not happeni for the bioconda RepeatMasker recipe. So it?s broken. >> >> For fixing it, look at the homebrew recipe for RepeatMasker. It does a good job where they also have it preconfigure itself for the free Dfam database rather than RepBase light ?> >> >> https://github.com/brewsci/homebrew-bio/blob/master/Formula/repeatmasker.rb >> >> te_proteins is not a RepeatMasker file. It?s a RepeatRunner file which has been integrated into MAKER. MAKER just needs to be able to find it. It will look in the ?/maker/data/ directory by default and put the location in te_protein= by default. >> >> ?Carson >> >> >> >> >>> On Oct 18, 2018, at 7:52 AM, Anthony Bretaudeau wrote: >>> >>> Hi, >>> >>> I think I finally found a solution for this segfault. In short: run "export THREADS_DAEMON_MODEL=1" before running maker. >>> >>> After looking at the debug log, I noticed that the segfault happened the first time the perl system() function was called (usually to launch a "mv" command). >>> >>> This + the backtrace shows that it has something to do with signal handling when running child process from threads. >>> >>> After a lot of trials and errors modifying the code, I found this page talking about this env var: https://metacpan.org/pod/forks#Co-existance-with-fork-aware-modules-and-environments >>> It seems to be enough to avoid the segfault. I have no idea if it could have any downside, but maker seems to give the same results as in non-mpi mode. >>> >>> >>> >>> Concerning RepeatMasker not being installed correctly, it seems to be intended as written in the RepeatMasker conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L16 >>> I use the REPEATMASKER_LIB_DIR env var so it's not really a problem for me, and the galaxy tools is doing the same (https://github.com/galaxyproject/tools-iuc/blob/master/tools/maker/maker.xml#L11 ). >>> >>> I'm not a RepeatMasker expert, so I don't know if providing the old database would make more sense... >>> >>> I guess it's the same question for te_proteins. >>> >>> >>> >>> Cheers >>> >>> Anthony >>> >>> >>> >>> >>> >>> Le 05/10/2018 ? 22:37, Carson Holt a ?crit : >>>> I tried setting this up but there are a number of issues I run into. >>>> >>>> First RepeatMasker is not being installed correctly. The configuration step should create these files (created by ./configure script during RepeatMasker setup) ?> >>>> RepeatMasker.lib >>>> RepeatMasker.lib.nhr >>>> RepeatMasker.lib.nin >>>> RepeatMasker.lib.nsq >>>> RepeatMaskerLib.embl >>>> >>>> But they do not exist in the share directory. >>>> >>>> Also MAKER needs access to the te_proteins file in ?/maker/data, and because you have rearranged maker?s structure it can?t find it. >>>> >>>> >>>> Then for the Segmentation fault, I have seen this a handful of times in the past where users install their own version of perl rather than using the system perl together with their own install of OpenMPI. The issue is some series of flags either in OpenMPi or perl (I?m not sure which). But one way around it is to disable the interpreter threads option when compiling and installing perl for yourself. Most system perl installs have interpreter threads enabled, so I?m not sure why some self-installs generate this segfault and never the system perl. Interestingly interpreter threads are turned off by default when you install perl manually as they are ?officially discouraged". You actually have to enable it during the self-install process, and conda is enabling them on the manual install to match most system perls. >>>> >>>> Another work around is don?t use OpenMPI. Try MPICH3. >>>> >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> >>>>> On Sep 25, 2018, at 6:10 AM, Anthony Bretaudeau > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I've worked on the Bioconda recipe for Maker (https://github.com/bioconda/bioconda-recipes/tree/master/recipes/maker/ ). It works well, except when using it in MPI mode. I get this segfault error: >>>>> >>>>> STATUS: Processing and indexing input FASTA files... >>>>> [cl1n022:06306] *** Process received signal *** >>>>> [cl1n022:06306] Signal: Segmentation fault (11) >>>>> [cl1n022:06306] Signal code: Address not mapped (1) >>>>> [cl1n022:06306] Failing at address: 0x514 >>>>> [cl1n022:06306] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 1] /local/miniconda3/envs/maker-2.31.10/bin/perl(Perl_csighandler+0x1e)[0x4aad4e] >>>>> [cl1n022:06306] [ 2] /lib64/libpthread.so.0(+0xf6d0)[0x2b9ce51026d0] >>>>> [cl1n022:06306] [ 3] /lib64/libc.so.6(__poll+0x2d)[0x2b9ce5f5cf0d] >>>>> [cl1n022:06306] [ 4] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x869e5)[0x2b9cf05859e5] >>>>> [cl1n022:06306] [ 5] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(opal_libevent2022_event_base_loop+0x242)[0x2b9cf057a73a] >>>>> [cl1n022:06306] [ 6] /local/miniconda3/envs/maker-2.31.10/perl/lib/auto/Parallel/Application/MPI/../../../../../../lib/./libopen-pal.so.40(+0x384de)[0x2b9cf05374de] >>>>> [cl1n022:06306] [ 7] /lib64/libpthread.so.0(+0x7e25)[0x2b9ce50fae25] >>>>> [cl1n022:06306] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2b9ce5f67bad] >>>>> [cl1n022:06306] *** End of error message *** >>>>> SIGTERM received >>>>> SIGTERM received >>>>> >>>>> >>>>> As mentioned in older posts, I've tried adding the LD_PRELOAD variable, or running mpirun with the "-mca btl ^openib" option, but it didn't help. >>>>> >>>>> As this happens with the Bioconda package, I guess it should be pretty reproducible on other setups. >>>>> >>>>> Bioconda's Maker package uses version 5.26.2 of Perl and version 3.1.2 of OpenMPI, and the OpenMPI recipe is on https://github.com/conda-forge/openmpi-feedstock/tree/master/recipe >>>>> Any help would be highly appreciated! >>>>> >>>>> Anthony Bretaudeau >>>>> >>>>> _______________________________________________ >>>>> maker-devel mailing list >>>>> maker-devel at box290.bluehost.com >>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Oct 24 09:50:43 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 24 Oct 2018 09:50:43 -0600 Subject: [maker-devel] EVM control file and est2genome In-Reply-To: References: Message-ID: <3CB0FDB0-8B7D-4CF8-B957-5935166D5305@gmail.com> est2genome only works with the data given to est=. For the second error, you must provide the path of the evm executable in maker_exe.ctl. It apparently was not in your PATH, so it didn?t get automatically filled out. Here is an example from the wiki of using est2genome and protein2genome to train SNAP for the next MAKER run ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors ?Carson > On Oct 24, 2018, at 3:28 AM, Linnie Linnie wrote: > > Hi, > > I am trying to run maker together with EVM. I want to annotate a genome for which there is no evidence data, which is why I am using ESTs and protein data from a closely related species. I am finding two unrelated issues. > > The first one is the following: > I set up the control files passing alt_est with a fasta file of ESTs, protein with protein from a closely related species as well as uniprot-sprot.fa, es2genome=1 and prot2genome=1. I am getting the following error: > > >ERROR: You must provide some form of EST evidence to use est2genome as a predictor. > > Does this mean I can only use est2genome with ESTs from the species of interest? > > The second error relates to EVM: > I have passed in the file maker_opts.ctl the option run_evm=1. I have used default parameters in the file maker_evm.ctl. I am getting the following error: > > >ERROR: You have failed to provide a value for 'evm' in the control files. > > Does this error relate to the maker_opts.ctl file or the maker_evm.ctl one? How could I fix it? > > > And lastly, a more general but fundamental question. Is my approach sensible? My plan is to run this evidence-based annotation, then perhaps train SNAP, Augustus and GeneMark, and use those output files to re-run maker with ab-initio parameters. > > I would appreciate any input on any of these issues. > > Thank you! > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jacques.dainat at nbis.se Wed Oct 24 02:41:05 2018 From: jacques.dainat at nbis.se (Jacques Dainat) Date: Wed, 24 Oct 2018 10:41:05 +0200 Subject: [maker-devel] CIGAR string explanation In-Reply-To: <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <9B7CB8C1-2272-4E2A-A435-73642920B623@gmail.com> Message-ID: Thanks for your response. It?s surprising the link in the Sequence Ontology web site doesn?t work anymore. I will notify them. I was surprise that I was not able finding any resource on internet describing these values. Helped by your answer I have refined my key words and googled again, and I finnaly found old ressources describing that too. from 2004 FlyBase here: http://rice.bio.indiana.edu:7082/annot/gff3.html from 2010 WormBase here: http://wiki.wormbase.org/index.php/GFF3specProposal I put a copy here of the Wormbase description in case those resources also disappear. At that time it sounds it was not yet officialy accepted by the SO. /Jacques > On 23 Oct 2018, at 17:55, Carson Holt wrote: > > Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn?t have this because it?s only half the cigar and it?s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it?s flipped because the alignment happens on the opposite strand. > > ?Carson > > >> On Oct 23, 2018, at 7:56 AM, Jacques Dainat > wrote: >> >> Hello, >> >> Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER) >> >> cigar: P46461.1 3 740 . genome 460484 439594 - 2580 M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115 >> vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$ >> -- completed exonerate analysis >> >> >> and here the result we get in the protein2genome.gff output from MAKER >> >> @000426F|arrow|arrow protein2genome protein_match 439595 460484 2580 - . ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6 >> @000426F|arrow|arrow protein2genome match_part 460399 460484 2580 - . ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28 >> @000426F|arrow|arrow protein2genome match_part 460135 460344 2580 - . ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2 >> @000426F|arrow|arrow protein2genome match_part 458437 458582 2580 - . ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2 >> @000426F|arrow|arrow protein2genome match_part 454953 455091 2580 - . ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1 >> @000426F|arrow|arrow protein2genome match_part 454674 454834 2580 - . ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2 >> @000426F|arrow|arrow protein2genome match_part 454296 454477 2580 - . ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1 >> @000426F|arrow|arrow protein2genome match_part 453985 454150 2580 - . ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55 >> @000426F|arrow|arrow protein2genome match_part 453401 453570 2580 - . ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1 >> @000426F|arrow|arrow protein2genome match_part 448042 448363 2580 - . ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107 >> @000426F|arrow|arrow protein2genome match_part 447761 447918 2580 - . ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1 >> @000426F|arrow|arrow protein2genome match_part 447460 447644 2580 - . ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61 >> @000426F|arrow|arrow protein2genome match_part 445484 445642 2580 - . ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2 >> @000426F|arrow|arrow protein2genome match_part 439595 439709 2580 - . ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2 >> >> MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don?t get the meaning of the R and F. >> Could you explain their meanings ? >> >> Best regards, >> >> /Jacques >> ------------------------------------------------- >> Jacques Dainat, Ph.D. >> NBIS (National Bioinformatics Infrastructure Sweden) >> Genome Annotation Service >> http://nbis.se/about/staff/jacques-dainat >> http://nbis.se >> >> ? Contact ? >> Address: Uppsala University, Biomedicinska Centrum >> Department of Medical Biochemistry Microbiology, Genomics >> Husargatan 3, box 582 >> S-75123 Uppsala Sweden >> Phone: +46 18 471 46 25 >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2018-10-24 at 10.00.41.png Type: image/png Size: 281561 bytes Desc: not available URL: From elyssa_garza at yahoo.com Wed Oct 24 15:27:50 2018 From: elyssa_garza at yahoo.com (Elyssa Garza) Date: Wed, 24 Oct 2018 21:27:50 +0000 (UTC) Subject: [maker-devel] Is gene retrieval from gff possible? In-Reply-To: <1576161756.398305.1540414096080@mail.yahoo.com> References: <8783564C-A8FA-419A-A651-EE53C1563A7F@nbis.se> <1576161756.398305.1540414096080@mail.yahoo.com> Message-ID: <1888825059.421524.1540416470195@mail.yahoo.com> Hello I recently annotated my plant genome and am looking at retrieving a particular set of genes from the maker results. I have a list of TAIR Ids that I am particularly interested in and was thinking about using the gff file to help pull out the associated transcripts. I was wondering if you could advise me on the best or easiest way of obtaining the associated TAIR accession or gene model from the gff file. I did try looking at the genes (41,779 genes) using CLCbio but the accessions were not easily identified or found. I also looked at the protein matches (819,805 protein matches) and was able to easily find gene model matches corresponding to my target accessions. Is it wise to do this? Can you explain why I can't find these same protein matches in the gene file? I have some ideas on why this is happening but I am looking for support for them. Elyssa -------------- next part -------------- An HTML attachment was scrubbed... URL: From pallavi.gupta at slu.edu Thu Oct 25 15:22:31 2018 From: pallavi.gupta at slu.edu (Pallavi Gupta) Date: Thu, 25 Oct 2018 21:22:31 +0000 Subject: [maker-devel] Issue with maker Message-ID: Hi Team MAKER, I am using maker for my research for genome annotation process. But when I run maker I am getting a weird error. I tried finding a work around on the internet by scrolling through various bioinformatics forum but I was unsuccessful. I will really appreciate if you can help me in this regard. I have attached my nohup.out log. Please let me know if you need anything else. Thanks, Pallavi Gupta -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nohup.out Type: application/octet-stream Size: 26365432 bytes Desc: nohup.out URL: From 17na34 at queensu.ca Wed Oct 31 08:27:44 2018 From: 17na34 at queensu.ca (Nikolay Alabi) Date: Wed, 31 Oct 2018 14:27:44 +0000 Subject: [maker-devel] MAKER not running properly after installation, help needed Message-ID: Hello, I am attempting to annotate a garlic mustard genome using maker on a cluster at Queen?s University. I have been following the tutorial on wiki and was attempting to use the practice data to see if the program is running properly and to learn how to train the gene predicting system. Maker is now installed and is working to an extent, however when in use it is not working properly and cannot read/annotate a genome. I suspect two problems that is causing this to occur, first, anytime any maker command is called, it shows that an argument in forks.pm in perl5 is not correct, after trying to fix the problem, I see that the code should be correct, but the error line still occurs. Then every time a maker command Is called another error saying there is an error flow occurring somewhere in perl again. For instance when I command: maker -h, or maker -CTL or anything to do with maker, the error lines occur. Would you advise me to reinstall perl and bioperl? Other than that I believe everything else is properly installed and I do not understand why the program is not running properly. I have even tried using different data genomes, however the same problem occurs of the run never finishing, then retrying, and ultimately failing. Please let me know if there is another possible source of error in the problem. Best regards, Nikolay -------------- next part -------------- An HTML attachment was scrubbed... URL: