From seoanezonjic at hotmail.com Tue Mar 6 03:30:24 2018 From: seoanezonjic at hotmail.com (p sz) Date: Tue, 6 Mar 2018 09:30:24 +0000 Subject: [maker-devel] Problems with failed contigs Message-ID: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 7 15:19:15 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 7 Mar 2018 13:19:15 -0800 Subject: [maker-devel] how to output masked genome from MAKER Message-ID: Hi MAKER community I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? Thanks for any help or insights. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From flopezo84 at gmail.com Fri Mar 9 10:15:39 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Fri, 9 Mar 2018 11:15:39 -0500 Subject: [maker-devel] Using PASA assemblies with MAKER Message-ID: Hello, I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: 1. PASA assemblies in FASTA format (est) 2. PASA assembly structures (est_gff) 3. ORFs from PASA assemblies (protein) And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. "ERROR: Non-unique top level ID for..." I suppose all the non-unique IDs need to be renamed for MAKER? Any help is greatly appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wangzhennan at ioz.ac.cn Tue Mar 13 22:53:44 2018 From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn) Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00) Subject: [maker-devel] Some transcripts have no AED? Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.ence at ufl.edu Wed Mar 14 06:33:01 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Wed, 14 Mar 2018 11:33:01 +0000 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu> Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From seoanezonjic at hotmail.com Wed Mar 14 08:52:12 2018 From: seoanezonjic at hotmail.com (p sz) Date: Wed, 14 Mar 2018 13:52:12 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: Hi I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850. --> rank=15, hostname=dx095 ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Sosen1_s1284 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Sosen1_s1284 Can you help me to fix this problem? Thank you in advance Pedro Seoane ________________________________ De: p sz Enviado: martes, 6 de marzo de 2018 9:30 Para: maker-devel at yandell-lab.org Asunto: Problems with failed contigs Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 14 19:21:26 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 14 Mar 2018 17:21:26 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Hi MAKER community I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12024 12024 313247 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12026 12026 313295 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From d.ence at ufl.edu Thu Mar 15 08:15:00 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Thu, 15 Mar 2018 13:15:00 +0000 Subject: [maker-devel] Fwd: Some transcripts have no AED? References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu> Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu> Begin forwarded message: From: "Ence,daniel" > Subject: Re: [maker-devel] Some transcripts have no AED? Date: March 15, 2018 at 9:06:45 AM EDT To: "wangzhennan at ioz.ac.cn" > Cc: "Ence,daniel" > Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results. ~Daniel On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn wrote: Hi, I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much! Best wishes. Wang -----Original Messages----- From:"Ence,daniel" > Sent Time:2018-03-14 19:33:01 (Wednesday) To: "wangzhennan at ioz.ac.cn" > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Some transcripts have no AED? Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:57:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 08:57:37 -0600 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson > On Mar 6, 2018, at 2:30 AM, p sz wrote: > > Hi > Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: > STARTED:3890 > FINISHED:3378 > So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: > substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 > and near this line, the following: > ERROR: Failed while annotating transcripts > My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? > Thanks in advance > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 10:15:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:15:09 -0600 Subject: [maker-devel] Using PASA assemblies with MAKER In-Reply-To: References: Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com> MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3. ?Carson > On Mar 9, 2018, at 9:15 AM, Federico L?pez wrote: > > Hello, > > I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: > > 1. PASA assemblies in FASTA format (est) > 2. PASA assembly structures (est_gff) > 3. ORFs from PASA assemblies (protein) > > And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. > > "ERROR: Non-unique top level ID for..." > > I suppose all the non-unique IDs need to be renamed for MAKER? > > Any help is greatly appreciated. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 10:20:08 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:20:08 -0600 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence. Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes). Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values. You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. pred_stats=0 #report AED and QI statistics for all predictions as well as models ?Carson > On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote: > > Hi, > > When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. > > Best wishes. > > > > Wang > > T > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 10:26:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:26:26 -0600 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). ?Carson > On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: > > Hi MAKER community > > I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. > > In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. > > To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. > > $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12024 12024 313247 > > 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. > > $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12026 12026 313295 > > 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. > > I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. > > After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. > > I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. > > Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Mar 15 10:31:31 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:31:31 -0600 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: References: Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> You will just have to find and concatenate the files yourself. Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta ?Carson > On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: > > Hi MAKER community > > I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? > > I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? > > Thanks for any help or insights. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From vsoza at uw.edu Thu Mar 15 13:18:46 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 15 Mar 2018 11:18:46 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu> Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers. -Valerie > On Mar 15, 2018, at 8:26 AM, Carson Holt wrote: > > If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. > > You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). > > ?Carson > > > >> On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. >> >> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. >> >> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. >> >> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12024 12024 313247 >> >> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. >> >> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12026 12026 313295 >> >> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. >> >> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. >> >> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. >> >> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. >> >> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Fri Mar 16 04:33:28 2018 From: seoanezonjic at hotmail.com (p sz) Date: Fri, 16 Mar 2018 09:33:28 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> References: , <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> Message-ID: I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me? Thank you in advance ________________________________ From: Carson Holt Sent: Thursday, March 15, 2018 2:57:37 PM To: p sz Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Problems with failed contigs First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson On Mar 6, 2018, at 2:30 AM, p sz > wrote: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Tue Mar 20 19:48:09 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 20 Mar 2018 17:48:09 -0700 Subject: [maker-devel] clarification on creating a standard build Message-ID: Hi MAKER community I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From urmi208 at gmail.com Wed Mar 21 04:05:42 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:05:42 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation Message-ID: Hello maker community, I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: 1. Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) 2. Create SNAP model with CEGMA 3. Train Augustus with BUSCO 4. Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) 5. Create SNAP model from run B. 6. Train Augustus with transcripts from run B and BUSCO 7. Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 As a result of this, I get following gene numbers: - run A: 12796 total genes out of which 12771 have AED < 0.5 - run B:10713 total genes out of which 10701 have AED < 0.5 - run C: 12651 total genes out of which 12582 have AED < 0.5 Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: *RunA* contig1 maker gene 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > > contig1 maker mRNA 20468 21193 100 + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > > contig1 maker exon 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 blastn expressed_sequence_match 20468 21193 726 + >> . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >> target_length=726 > > contig1 blastn match_part 20468 21193 726 + . >> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > contig1 est2genome expressed_sequence_match 20468 21193 >> 3630 + . >> ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > > contig1 est2genome match_part 20468 21193 3630 + . >> >> ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunB:* > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunC: * > contig1 maker gene 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > > contig1 maker mRNA 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > > contig1 maker exon 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 snap_masked match 20468 21193 42.956 + . >> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > > contig1 snap_masked match_part 20468 21193 42.956 + . >> >> ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 >> 1 726 +;Gap=M726 > > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > Please could anyone shed come light on this? Many thanks in advance. Urmi -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Wed Mar 21 04:24:32 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:24:32 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: Further to this, I did run interproscan on all three runs and 100% of the genes from all of them have protein domains found. I am confused which one should I consider as the best annotation. I am sorry for so many questions but I am very new to maker. Thanks again for any help you could provide. On Wed, Mar 21, 2018 at 9:05 AM, Urmi wrote: > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > -- "The only way of finding the limits of the possible is by going beyond them into the impossible.*" **- Arthur C. Clarke* -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 23 12:20:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:20:22 -0600 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: References: Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. You then have two alternate ways to get those models into your dataset. 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. ?Carson > On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: > > Hi MAKER community > > I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. > > I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: > "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? > > Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. > > What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Mar 23 12:28:50 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:28:50 -0600 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models. Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity) Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss). Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html ?Carson > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: > > Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) > Create SNAP model with CEGMA > Train Augustus with BUSCO > Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) > Create SNAP model from run B. > Train Augustus with transcripts from run B and BUSCO > Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 > As a result of this, I get following gene numbers: > > run A: 12796 total genes out of which 12771 have AED < 0.5 > run B:10713 total genes out of which 10701 have AED < 0.5 > run C: 12651 total genes out of which 12582 have AED < 0.5 > Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: > > RunA > > contig1 maker gene 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > contig1 maker mRNA 20468 21193 100 + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > contig1 maker exon 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 blastn expressed_sequence_match 20468 21193 726 + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726 > contig1 blastn match_part 20468 21193 726 + . ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > contig1 est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > contig1 est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunB: > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunC: > contig1 maker gene 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > contig1 maker mRNA 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > contig1 maker exon 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 snap_masked match 20468 21193 42.956 + . ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > contig1 snap_masked match_part 20468 21193 42.956 + . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Mon Mar 26 02:28:21 2018 From: urmi208 at gmail.com (Urmi) Date: Mon, 26 Mar 2018 08:28:21 +0100 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> References: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Message-ID: That's great! Thanks for the tips Carson. Urmi On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt wrote: > Run A ?> no gene prediction, just cut and paste of transcript/protein > alignments to generate rough models. > Run B ?> Gene predictions based on training using only highly conserved > subset of genes (you will have low sensitivity) > Run C ?> Gene predictions based on training using broader gene set. Higher > sensitivity but potentially lower specificity (sensitivity gains should > outweigh any specificity loss). > > Finally, mnake sure you look at models in a browser to see how well > evidence and models overlap. If gene fusion is an issue (falsely merged > mRNA-seq assembly results will generate hints that can cause gene > predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/ > defusion/installation.html > > ?Carson > > > > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Mon Mar 26 13:49:24 2018 From: vsoza at uw.edu (Valerie Soza) Date: Mon, 26 Mar 2018 11:49:24 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Hi Carson Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. I created the .gff file by this command: gff3_merge -d Rwill7_master_datastore_index.log I created the .fasta files by this command: fasta_merge -d Rwill7_master_datastore_index.log I ran InterProScan with this command: interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff #no results There is no "processed-gene" with this ID in the Rwill7.all.gff file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta #no results using the ?abinit-gene? Name from the .gff file versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? Thanks for your help. -Valerie > On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: > > You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. > > All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. > > You then have two alternate ways to get those models into your dataset. > > 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. > > That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. > > 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. > > This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. > > ?Carson > > > >> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >> >> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >> >> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >> >> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Tue Mar 27 11:50:38 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 27 Mar 2018 09:50:38 -0700 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> References: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu> Hi Carson Thanks, that is simple and it worked. I did the following to sort and concatenate the query.masked.fasta files into one fasta: $ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta -Valerie > On Mar 15, 2018, at 8:31 AM, Carson Holt wrote: > > You will just have to find and concatenate the files yourself. > > Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta > > ?Carson > > >> On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? >> >> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? >> >> Thanks for any help or insights. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Thu Mar 29 13:42:28 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 29 Mar 2018 11:42:28 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu> Hi MAKER community, I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file. I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed Then I extracted only the IDs from the .tsv file to grep against the all.gff file. cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep. sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :) -Valerie > On Mar 26, 2018, at 11:49 AM, Valerie Soza wrote: > > Hi Carson > > Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. > > I created the .gff file by this command: > gff3_merge -d Rwill7_master_datastore_index.log > > I created the .fasta files by this command: > fasta_merge -d Rwill7_master_datastore_index.log > > I ran InterProScan with this command: > interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > > When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: > > $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > > snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 > 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff > #no results > > There is no "processed-gene" with this ID in the Rwill7.all.gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff > > LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 > LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 > LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > > However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff > > #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? > > LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 > LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 > LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 > > So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: > > $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > #no results using the ?abinit-gene? Name from the .gff file > > versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 > > I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? > > If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? > > Thanks for your help. > > -Valerie > >> On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: >> >> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. >> >> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. >> >> You then have two alternate ways to get those models into your dataset. >> >> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. >> >> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. >> >> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. >> >> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. >> >> ?Carson >> >> >> >>> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >>> >>> Hi MAKER community >>> >>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >>> >>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >>> >>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >>> >>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >>> >>> Thanks. >>> >>> -Valerie >>> >>> Valerie Soza, Ph.D. >>> c/o Hall Lab >>> Department of Biology >>> University of Washington >>> Johnson Hall 202A >>> Box 351800 >>> Seattle, WA 98195-1800 >>> 206-543-6740 >>> http://staff.washington.edu/vsoza/ >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Tue Mar 6 02:30:24 2018 From: seoanezonjic at hotmail.com (p sz) Date: Tue, 6 Mar 2018 09:30:24 +0000 Subject: [maker-devel] Problems with failed contigs Message-ID: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 7 14:19:15 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 7 Mar 2018 13:19:15 -0800 Subject: [maker-devel] how to output masked genome from MAKER Message-ID: Hi MAKER community I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? Thanks for any help or insights. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From flopezo84 at gmail.com Fri Mar 9 09:15:39 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Fri, 9 Mar 2018 11:15:39 -0500 Subject: [maker-devel] Using PASA assemblies with MAKER Message-ID: Hello, I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: 1. PASA assemblies in FASTA format (est) 2. PASA assembly structures (est_gff) 3. ORFs from PASA assemblies (protein) And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. "ERROR: Non-unique top level ID for..." I suppose all the non-unique IDs need to be renamed for MAKER? Any help is greatly appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wangzhennan at ioz.ac.cn Tue Mar 13 21:53:44 2018 From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn) Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00) Subject: [maker-devel] Some transcripts have no AED? Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.ence at ufl.edu Wed Mar 14 05:33:01 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Wed, 14 Mar 2018 11:33:01 +0000 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu> Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From seoanezonjic at hotmail.com Wed Mar 14 07:52:12 2018 From: seoanezonjic at hotmail.com (p sz) Date: Wed, 14 Mar 2018 13:52:12 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: Hi I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850. --> rank=15, hostname=dx095 ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Sosen1_s1284 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Sosen1_s1284 Can you help me to fix this problem? Thank you in advance Pedro Seoane ________________________________ De: p sz Enviado: martes, 6 de marzo de 2018 9:30 Para: maker-devel at yandell-lab.org Asunto: Problems with failed contigs Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 14 18:21:26 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 14 Mar 2018 17:21:26 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Hi MAKER community I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12024 12024 313247 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12026 12026 313295 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From d.ence at ufl.edu Thu Mar 15 07:15:00 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Thu, 15 Mar 2018 13:15:00 +0000 Subject: [maker-devel] Fwd: Some transcripts have no AED? References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu> Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu> Begin forwarded message: From: "Ence,daniel" > Subject: Re: [maker-devel] Some transcripts have no AED? Date: March 15, 2018 at 9:06:45 AM EDT To: "wangzhennan at ioz.ac.cn" > Cc: "Ence,daniel" > Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results. ~Daniel On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn wrote: Hi, I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much! Best wishes. Wang -----Original Messages----- From:"Ence,daniel" > Sent Time:2018-03-14 19:33:01 (Wednesday) To: "wangzhennan at ioz.ac.cn" > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Some transcripts have no AED? Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 08:57:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 08:57:37 -0600 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson > On Mar 6, 2018, at 2:30 AM, p sz wrote: > > Hi > Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: > STARTED:3890 > FINISHED:3378 > So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: > substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 > and near this line, the following: > ERROR: Failed while annotating transcripts > My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? > Thanks in advance > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:15:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:15:09 -0600 Subject: [maker-devel] Using PASA assemblies with MAKER In-Reply-To: References: Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com> MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3. ?Carson > On Mar 9, 2018, at 9:15 AM, Federico L?pez wrote: > > Hello, > > I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: > > 1. PASA assemblies in FASTA format (est) > 2. PASA assembly structures (est_gff) > 3. ORFs from PASA assemblies (protein) > > And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. > > "ERROR: Non-unique top level ID for..." > > I suppose all the non-unique IDs need to be renamed for MAKER? > > Any help is greatly appreciated. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:20:08 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:20:08 -0600 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence. Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes). Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values. You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. pred_stats=0 #report AED and QI statistics for all predictions as well as models ?Carson > On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote: > > Hi, > > When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. > > Best wishes. > > > > Wang > > T > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:26:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:26:26 -0600 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). ?Carson > On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: > > Hi MAKER community > > I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. > > In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. > > To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. > > $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12024 12024 313247 > > 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. > > $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12026 12026 313295 > > 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. > > I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. > > After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. > > I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. > > Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Mar 15 09:31:31 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:31:31 -0600 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: References: Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> You will just have to find and concatenate the files yourself. Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta ?Carson > On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: > > Hi MAKER community > > I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? > > I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? > > Thanks for any help or insights. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From vsoza at uw.edu Thu Mar 15 12:18:46 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 15 Mar 2018 11:18:46 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu> Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers. -Valerie > On Mar 15, 2018, at 8:26 AM, Carson Holt wrote: > > If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. > > You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). > > ?Carson > > > >> On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. >> >> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. >> >> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. >> >> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12024 12024 313247 >> >> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. >> >> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12026 12026 313295 >> >> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. >> >> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. >> >> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. >> >> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. >> >> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Fri Mar 16 03:33:28 2018 From: seoanezonjic at hotmail.com (p sz) Date: Fri, 16 Mar 2018 09:33:28 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> References: , <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> Message-ID: I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me? Thank you in advance ________________________________ From: Carson Holt Sent: Thursday, March 15, 2018 2:57:37 PM To: p sz Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Problems with failed contigs First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson On Mar 6, 2018, at 2:30 AM, p sz > wrote: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Tue Mar 20 18:48:09 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 20 Mar 2018 17:48:09 -0700 Subject: [maker-devel] clarification on creating a standard build Message-ID: Hi MAKER community I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From urmi208 at gmail.com Wed Mar 21 03:05:42 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:05:42 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation Message-ID: Hello maker community, I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: 1. Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) 2. Create SNAP model with CEGMA 3. Train Augustus with BUSCO 4. Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) 5. Create SNAP model from run B. 6. Train Augustus with transcripts from run B and BUSCO 7. Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 As a result of this, I get following gene numbers: - run A: 12796 total genes out of which 12771 have AED < 0.5 - run B:10713 total genes out of which 10701 have AED < 0.5 - run C: 12651 total genes out of which 12582 have AED < 0.5 Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: *RunA* contig1 maker gene 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > > contig1 maker mRNA 20468 21193 100 + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > > contig1 maker exon 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 blastn expressed_sequence_match 20468 21193 726 + >> . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >> target_length=726 > > contig1 blastn match_part 20468 21193 726 + . >> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > contig1 est2genome expressed_sequence_match 20468 21193 >> 3630 + . >> ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > > contig1 est2genome match_part 20468 21193 3630 + . >> >> ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunB:* > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunC: * > contig1 maker gene 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > > contig1 maker mRNA 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > > contig1 maker exon 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 snap_masked match 20468 21193 42.956 + . >> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > > contig1 snap_masked match_part 20468 21193 42.956 + . >> >> ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 >> 1 726 +;Gap=M726 > > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > Please could anyone shed come light on this? Many thanks in advance. Urmi -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Wed Mar 21 03:24:32 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:24:32 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: Further to this, I did run interproscan on all three runs and 100% of the genes from all of them have protein domains found. I am confused which one should I consider as the best annotation. I am sorry for so many questions but I am very new to maker. Thanks again for any help you could provide. On Wed, Mar 21, 2018 at 9:05 AM, Urmi wrote: > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > -- "The only way of finding the limits of the possible is by going beyond them into the impossible.*" **- Arthur C. Clarke* -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 23 11:20:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:20:22 -0600 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: References: Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. You then have two alternate ways to get those models into your dataset. 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. ?Carson > On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: > > Hi MAKER community > > I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. > > I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: > "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? > > Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. > > What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Mar 23 11:28:50 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:28:50 -0600 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models. Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity) Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss). Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html ?Carson > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: > > Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) > Create SNAP model with CEGMA > Train Augustus with BUSCO > Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) > Create SNAP model from run B. > Train Augustus with transcripts from run B and BUSCO > Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 > As a result of this, I get following gene numbers: > > run A: 12796 total genes out of which 12771 have AED < 0.5 > run B:10713 total genes out of which 10701 have AED < 0.5 > run C: 12651 total genes out of which 12582 have AED < 0.5 > Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: > > RunA > > contig1 maker gene 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > contig1 maker mRNA 20468 21193 100 + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > contig1 maker exon 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 blastn expressed_sequence_match 20468 21193 726 + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726 > contig1 blastn match_part 20468 21193 726 + . ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > contig1 est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > contig1 est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunB: > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunC: > contig1 maker gene 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > contig1 maker mRNA 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > contig1 maker exon 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 snap_masked match 20468 21193 42.956 + . ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > contig1 snap_masked match_part 20468 21193 42.956 + . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Mon Mar 26 01:28:21 2018 From: urmi208 at gmail.com (Urmi) Date: Mon, 26 Mar 2018 08:28:21 +0100 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> References: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Message-ID: That's great! Thanks for the tips Carson. Urmi On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt wrote: > Run A ?> no gene prediction, just cut and paste of transcript/protein > alignments to generate rough models. > Run B ?> Gene predictions based on training using only highly conserved > subset of genes (you will have low sensitivity) > Run C ?> Gene predictions based on training using broader gene set. Higher > sensitivity but potentially lower specificity (sensitivity gains should > outweigh any specificity loss). > > Finally, mnake sure you look at models in a browser to see how well > evidence and models overlap. If gene fusion is an issue (falsely merged > mRNA-seq assembly results will generate hints that can cause gene > predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/ > defusion/installation.html > > ?Carson > > > > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Mon Mar 26 12:49:24 2018 From: vsoza at uw.edu (Valerie Soza) Date: Mon, 26 Mar 2018 11:49:24 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Hi Carson Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. I created the .gff file by this command: gff3_merge -d Rwill7_master_datastore_index.log I created the .fasta files by this command: fasta_merge -d Rwill7_master_datastore_index.log I ran InterProScan with this command: interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff #no results There is no "processed-gene" with this ID in the Rwill7.all.gff file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta #no results using the ?abinit-gene? Name from the .gff file versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? Thanks for your help. -Valerie > On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: > > You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. > > All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. > > You then have two alternate ways to get those models into your dataset. > > 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. > > That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. > > 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. > > This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. > > ?Carson > > > >> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >> >> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >> >> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >> >> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Tue Mar 27 10:50:38 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 27 Mar 2018 09:50:38 -0700 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> References: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu> Hi Carson Thanks, that is simple and it worked. I did the following to sort and concatenate the query.masked.fasta files into one fasta: $ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta -Valerie > On Mar 15, 2018, at 8:31 AM, Carson Holt wrote: > > You will just have to find and concatenate the files yourself. > > Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta > > ?Carson > > >> On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? >> >> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? >> >> Thanks for any help or insights. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Thu Mar 29 12:42:28 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 29 Mar 2018 11:42:28 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu> Hi MAKER community, I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file. I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed Then I extracted only the IDs from the .tsv file to grep against the all.gff file. cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep. sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :) -Valerie > On Mar 26, 2018, at 11:49 AM, Valerie Soza wrote: > > Hi Carson > > Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. > > I created the .gff file by this command: > gff3_merge -d Rwill7_master_datastore_index.log > > I created the .fasta files by this command: > fasta_merge -d Rwill7_master_datastore_index.log > > I ran InterProScan with this command: > interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > > When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: > > $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > > snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 > 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff > #no results > > There is no "processed-gene" with this ID in the Rwill7.all.gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff > > LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 > LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 > LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > > However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff > > #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? > > LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 > LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 > LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 > > So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: > > $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > #no results using the ?abinit-gene? Name from the .gff file > > versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 > > I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? > > If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? > > Thanks for your help. > > -Valerie > >> On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: >> >> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. >> >> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. >> >> You then have two alternate ways to get those models into your dataset. >> >> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. >> >> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. >> >> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. >> >> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. >> >> ?Carson >> >> >> >>> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >>> >>> Hi MAKER community >>> >>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >>> >>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >>> >>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >>> >>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >>> >>> Thanks. >>> >>> -Valerie >>> >>> Valerie Soza, Ph.D. >>> c/o Hall Lab >>> Department of Biology >>> University of Washington >>> Johnson Hall 202A >>> Box 351800 >>> Seattle, WA 98195-1800 >>> 206-543-6740 >>> http://staff.washington.edu/vsoza/ >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Tue Mar 6 02:30:24 2018 From: seoanezonjic at hotmail.com (p sz) Date: Tue, 6 Mar 2018 09:30:24 +0000 Subject: [maker-devel] Problems with failed contigs Message-ID: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 7 14:19:15 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 7 Mar 2018 13:19:15 -0800 Subject: [maker-devel] how to output masked genome from MAKER Message-ID: Hi MAKER community I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? Thanks for any help or insights. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From flopezo84 at gmail.com Fri Mar 9 09:15:39 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Fri, 9 Mar 2018 11:15:39 -0500 Subject: [maker-devel] Using PASA assemblies with MAKER Message-ID: Hello, I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: 1. PASA assemblies in FASTA format (est) 2. PASA assembly structures (est_gff) 3. ORFs from PASA assemblies (protein) And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. "ERROR: Non-unique top level ID for..." I suppose all the non-unique IDs need to be renamed for MAKER? Any help is greatly appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wangzhennan at ioz.ac.cn Tue Mar 13 21:53:44 2018 From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn) Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00) Subject: [maker-devel] Some transcripts have no AED? Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.ence at ufl.edu Wed Mar 14 05:33:01 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Wed, 14 Mar 2018 11:33:01 +0000 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu> Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From seoanezonjic at hotmail.com Wed Mar 14 07:52:12 2018 From: seoanezonjic at hotmail.com (p sz) Date: Wed, 14 Mar 2018 13:52:12 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: Hi I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850. --> rank=15, hostname=dx095 ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Sosen1_s1284 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Sosen1_s1284 Can you help me to fix this problem? Thank you in advance Pedro Seoane ________________________________ De: p sz Enviado: martes, 6 de marzo de 2018 9:30 Para: maker-devel at yandell-lab.org Asunto: Problems with failed contigs Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 14 18:21:26 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 14 Mar 2018 17:21:26 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Hi MAKER community I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12024 12024 313247 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12026 12026 313295 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From d.ence at ufl.edu Thu Mar 15 07:15:00 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Thu, 15 Mar 2018 13:15:00 +0000 Subject: [maker-devel] Fwd: Some transcripts have no AED? References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu> Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu> Begin forwarded message: From: "Ence,daniel" > Subject: Re: [maker-devel] Some transcripts have no AED? Date: March 15, 2018 at 9:06:45 AM EDT To: "wangzhennan at ioz.ac.cn" > Cc: "Ence,daniel" > Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results. ~Daniel On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn wrote: Hi, I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much! Best wishes. Wang -----Original Messages----- From:"Ence,daniel" > Sent Time:2018-03-14 19:33:01 (Wednesday) To: "wangzhennan at ioz.ac.cn" > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Some transcripts have no AED? Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 08:57:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 08:57:37 -0600 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson > On Mar 6, 2018, at 2:30 AM, p sz wrote: > > Hi > Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: > STARTED:3890 > FINISHED:3378 > So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: > substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 > and near this line, the following: > ERROR: Failed while annotating transcripts > My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? > Thanks in advance > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:15:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:15:09 -0600 Subject: [maker-devel] Using PASA assemblies with MAKER In-Reply-To: References: Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com> MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3. ?Carson > On Mar 9, 2018, at 9:15 AM, Federico L?pez wrote: > > Hello, > > I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: > > 1. PASA assemblies in FASTA format (est) > 2. PASA assembly structures (est_gff) > 3. ORFs from PASA assemblies (protein) > > And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. > > "ERROR: Non-unique top level ID for..." > > I suppose all the non-unique IDs need to be renamed for MAKER? > > Any help is greatly appreciated. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:20:08 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:20:08 -0600 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence. Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes). Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values. You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. pred_stats=0 #report AED and QI statistics for all predictions as well as models ?Carson > On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote: > > Hi, > > When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. > > Best wishes. > > > > Wang > > T > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:26:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:26:26 -0600 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). ?Carson > On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: > > Hi MAKER community > > I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. > > In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. > > To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. > > $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12024 12024 313247 > > 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. > > $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12026 12026 313295 > > 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. > > I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. > > After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. > > I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. > > Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Mar 15 09:31:31 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:31:31 -0600 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: References: Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> You will just have to find and concatenate the files yourself. Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta ?Carson > On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: > > Hi MAKER community > > I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? > > I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? > > Thanks for any help or insights. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From vsoza at uw.edu Thu Mar 15 12:18:46 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 15 Mar 2018 11:18:46 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu> Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers. -Valerie > On Mar 15, 2018, at 8:26 AM, Carson Holt wrote: > > If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. > > You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). > > ?Carson > > > >> On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. >> >> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. >> >> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. >> >> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12024 12024 313247 >> >> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. >> >> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12026 12026 313295 >> >> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. >> >> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. >> >> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. >> >> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. >> >> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Fri Mar 16 03:33:28 2018 From: seoanezonjic at hotmail.com (p sz) Date: Fri, 16 Mar 2018 09:33:28 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> References: , <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> Message-ID: I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me? Thank you in advance ________________________________ From: Carson Holt Sent: Thursday, March 15, 2018 2:57:37 PM To: p sz Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Problems with failed contigs First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson On Mar 6, 2018, at 2:30 AM, p sz > wrote: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Tue Mar 20 18:48:09 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 20 Mar 2018 17:48:09 -0700 Subject: [maker-devel] clarification on creating a standard build Message-ID: Hi MAKER community I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From urmi208 at gmail.com Wed Mar 21 03:05:42 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:05:42 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation Message-ID: Hello maker community, I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: 1. Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) 2. Create SNAP model with CEGMA 3. Train Augustus with BUSCO 4. Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) 5. Create SNAP model from run B. 6. Train Augustus with transcripts from run B and BUSCO 7. Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 As a result of this, I get following gene numbers: - run A: 12796 total genes out of which 12771 have AED < 0.5 - run B:10713 total genes out of which 10701 have AED < 0.5 - run C: 12651 total genes out of which 12582 have AED < 0.5 Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: *RunA* contig1 maker gene 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > > contig1 maker mRNA 20468 21193 100 + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > > contig1 maker exon 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 blastn expressed_sequence_match 20468 21193 726 + >> . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >> target_length=726 > > contig1 blastn match_part 20468 21193 726 + . >> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > contig1 est2genome expressed_sequence_match 20468 21193 >> 3630 + . >> ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > > contig1 est2genome match_part 20468 21193 3630 + . >> >> ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunB:* > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunC: * > contig1 maker gene 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > > contig1 maker mRNA 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > > contig1 maker exon 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 snap_masked match 20468 21193 42.956 + . >> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > > contig1 snap_masked match_part 20468 21193 42.956 + . >> >> ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 >> 1 726 +;Gap=M726 > > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > Please could anyone shed come light on this? Many thanks in advance. Urmi -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Wed Mar 21 03:24:32 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:24:32 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: Further to this, I did run interproscan on all three runs and 100% of the genes from all of them have protein domains found. I am confused which one should I consider as the best annotation. I am sorry for so many questions but I am very new to maker. Thanks again for any help you could provide. On Wed, Mar 21, 2018 at 9:05 AM, Urmi wrote: > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > -- "The only way of finding the limits of the possible is by going beyond them into the impossible.*" **- Arthur C. Clarke* -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 23 11:20:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:20:22 -0600 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: References: Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. You then have two alternate ways to get those models into your dataset. 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. ?Carson > On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: > > Hi MAKER community > > I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. > > I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: > "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? > > Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. > > What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Mar 23 11:28:50 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:28:50 -0600 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models. Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity) Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss). Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html ?Carson > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: > > Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) > Create SNAP model with CEGMA > Train Augustus with BUSCO > Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) > Create SNAP model from run B. > Train Augustus with transcripts from run B and BUSCO > Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 > As a result of this, I get following gene numbers: > > run A: 12796 total genes out of which 12771 have AED < 0.5 > run B:10713 total genes out of which 10701 have AED < 0.5 > run C: 12651 total genes out of which 12582 have AED < 0.5 > Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: > > RunA > > contig1 maker gene 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > contig1 maker mRNA 20468 21193 100 + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > contig1 maker exon 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 blastn expressed_sequence_match 20468 21193 726 + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726 > contig1 blastn match_part 20468 21193 726 + . ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > contig1 est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > contig1 est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunB: > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunC: > contig1 maker gene 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > contig1 maker mRNA 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > contig1 maker exon 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 snap_masked match 20468 21193 42.956 + . ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > contig1 snap_masked match_part 20468 21193 42.956 + . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Mon Mar 26 01:28:21 2018 From: urmi208 at gmail.com (Urmi) Date: Mon, 26 Mar 2018 08:28:21 +0100 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> References: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Message-ID: That's great! Thanks for the tips Carson. Urmi On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt wrote: > Run A ?> no gene prediction, just cut and paste of transcript/protein > alignments to generate rough models. > Run B ?> Gene predictions based on training using only highly conserved > subset of genes (you will have low sensitivity) > Run C ?> Gene predictions based on training using broader gene set. Higher > sensitivity but potentially lower specificity (sensitivity gains should > outweigh any specificity loss). > > Finally, mnake sure you look at models in a browser to see how well > evidence and models overlap. If gene fusion is an issue (falsely merged > mRNA-seq assembly results will generate hints that can cause gene > predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/ > defusion/installation.html > > ?Carson > > > > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Mon Mar 26 12:49:24 2018 From: vsoza at uw.edu (Valerie Soza) Date: Mon, 26 Mar 2018 11:49:24 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Hi Carson Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. I created the .gff file by this command: gff3_merge -d Rwill7_master_datastore_index.log I created the .fasta files by this command: fasta_merge -d Rwill7_master_datastore_index.log I ran InterProScan with this command: interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff #no results There is no "processed-gene" with this ID in the Rwill7.all.gff file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta #no results using the ?abinit-gene? Name from the .gff file versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? Thanks for your help. -Valerie > On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: > > You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. > > All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. > > You then have two alternate ways to get those models into your dataset. > > 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. > > That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. > > 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. > > This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. > > ?Carson > > > >> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >> >> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >> >> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >> >> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Tue Mar 27 10:50:38 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 27 Mar 2018 09:50:38 -0700 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> References: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu> Hi Carson Thanks, that is simple and it worked. I did the following to sort and concatenate the query.masked.fasta files into one fasta: $ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta -Valerie > On Mar 15, 2018, at 8:31 AM, Carson Holt wrote: > > You will just have to find and concatenate the files yourself. > > Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta > > ?Carson > > >> On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? >> >> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? >> >> Thanks for any help or insights. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Thu Mar 29 12:42:28 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 29 Mar 2018 11:42:28 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu> Hi MAKER community, I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file. I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed Then I extracted only the IDs from the .tsv file to grep against the all.gff file. cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep. sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :) -Valerie > On Mar 26, 2018, at 11:49 AM, Valerie Soza wrote: > > Hi Carson > > Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. > > I created the .gff file by this command: > gff3_merge -d Rwill7_master_datastore_index.log > > I created the .fasta files by this command: > fasta_merge -d Rwill7_master_datastore_index.log > > I ran InterProScan with this command: > interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > > When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: > > $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > > snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 > 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff > #no results > > There is no "processed-gene" with this ID in the Rwill7.all.gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff > > LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 > LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 > LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > > However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff > > #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? > > LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 > LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 > LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 > > So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: > > $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > #no results using the ?abinit-gene? Name from the .gff file > > versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 > > I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? > > If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? > > Thanks for your help. > > -Valerie > >> On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: >> >> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. >> >> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. >> >> You then have two alternate ways to get those models into your dataset. >> >> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. >> >> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. >> >> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. >> >> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. >> >> ?Carson >> >> >> >>> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >>> >>> Hi MAKER community >>> >>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >>> >>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >>> >>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >>> >>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >>> >>> Thanks. >>> >>> -Valerie >>> >>> Valerie Soza, Ph.D. >>> c/o Hall Lab >>> Department of Biology >>> University of Washington >>> Johnson Hall 202A >>> Box 351800 >>> Seattle, WA 98195-1800 >>> 206-543-6740 >>> http://staff.washington.edu/vsoza/ >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Tue Mar 6 02:30:24 2018 From: seoanezonjic at hotmail.com (p sz) Date: Tue, 6 Mar 2018 09:30:24 +0000 Subject: [maker-devel] Problems with failed contigs Message-ID: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 7 14:19:15 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 7 Mar 2018 13:19:15 -0800 Subject: [maker-devel] how to output masked genome from MAKER Message-ID: Hi MAKER community I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? Thanks for any help or insights. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From flopezo84 at gmail.com Fri Mar 9 09:15:39 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Fri, 9 Mar 2018 11:15:39 -0500 Subject: [maker-devel] Using PASA assemblies with MAKER Message-ID: Hello, I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: 1. PASA assemblies in FASTA format (est) 2. PASA assembly structures (est_gff) 3. ORFs from PASA assemblies (protein) And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. "ERROR: Non-unique top level ID for..." I suppose all the non-unique IDs need to be renamed for MAKER? Any help is greatly appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wangzhennan at ioz.ac.cn Tue Mar 13 21:53:44 2018 From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn) Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00) Subject: [maker-devel] Some transcripts have no AED? Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.ence at ufl.edu Wed Mar 14 05:33:01 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Wed, 14 Mar 2018 11:33:01 +0000 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu> Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From seoanezonjic at hotmail.com Wed Mar 14 07:52:12 2018 From: seoanezonjic at hotmail.com (p sz) Date: Wed, 14 Mar 2018 13:52:12 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: Hi I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850. --> rank=15, hostname=dx095 ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Sosen1_s1284 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Sosen1_s1284 Can you help me to fix this problem? Thank you in advance Pedro Seoane ________________________________ De: p sz Enviado: martes, 6 de marzo de 2018 9:30 Para: maker-devel at yandell-lab.org Asunto: Problems with failed contigs Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Wed Mar 14 18:21:26 2018 From: vsoza at uw.edu (Valerie Soza) Date: Wed, 14 Mar 2018 17:21:26 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Hi MAKER community I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12024 12024 313247 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 12026 12026 313295 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From d.ence at ufl.edu Thu Mar 15 07:15:00 2018 From: d.ence at ufl.edu (Ence,daniel) Date: Thu, 15 Mar 2018 13:15:00 +0000 Subject: [maker-devel] Fwd: Some transcripts have no AED? References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu> Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu> Begin forwarded message: From: "Ence,daniel" > Subject: Re: [maker-devel] Some transcripts have no AED? Date: March 15, 2018 at 9:06:45 AM EDT To: "wangzhennan at ioz.ac.cn" > Cc: "Ence,daniel" > Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results. ~Daniel On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn wrote: Hi, I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much! Best wishes. Wang -----Original Messages----- From:"Ence,daniel" > Sent Time:2018-03-14 19:33:01 (Wednesday) To: "wangzhennan at ioz.ac.cn" > Cc: "maker-devel at yandell-lab.org" > Subject: Re: [maker-devel] Some transcripts have no AED? Hi, can you send a few lines of examples? Do some transcripts do have AEDs? ~Daniel Sent from my iPhone On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn" > wrote: Hi, When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. Best wishes. Wang T _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 08:57:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 08:57:37 -0600 Subject: [maker-devel] Problems with failed contigs In-Reply-To: References: Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson > On Mar 6, 2018, at 2:30 AM, p sz wrote: > > Hi > Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: > STARTED:3890 > FINISHED:3378 > So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: > substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 > and near this line, the following: > ERROR: Failed while annotating transcripts > My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? > Thanks in advance > > > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:15:09 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:15:09 -0600 Subject: [maker-devel] Using PASA assemblies with MAKER In-Reply-To: References: Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com> MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3. ?Carson > On Mar 9, 2018, at 9:15 AM, Federico L?pez wrote: > > Hello, > > I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3: > > 1. PASA assemblies in FASTA format (est) > 2. PASA assembly structures (est_gff) > 3. ORFs from PASA assemblies (protein) > > And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below. > > "ERROR: Non-unique top level ID for..." > > I suppose all the non-unique IDs need to be renamed for MAKER? > > Any help is greatly appreciated. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:20:08 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:20:08 -0600 Subject: [maker-devel] Some transcripts have no AED? In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn> Message-ID: Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence. Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes). Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values. You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. pred_stats=0 #report AED and QI statistics for all predictions as well as models ?Carson > On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote: > > Hi, > > When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you. > > Best wishes. > > > > Wang > > T > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Mar 15 09:26:26 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:26:26 -0600 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). ?Carson > On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: > > Hi MAKER community > > I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. > > In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. > > To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. > > $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12024 12024 313247 > > 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. > > $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 12026 12026 313295 > > 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. > > I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. > > After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. > > I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. > > Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Thu Mar 15 09:31:31 2018 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 15 Mar 2018 09:31:31 -0600 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: References: Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> You will just have to find and concatenate the files yourself. Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta ?Carson > On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: > > Hi MAKER community > > I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? > > I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? > > Thanks for any help or insights. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From vsoza at uw.edu Thu Mar 15 12:18:46 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 15 Mar 2018 11:18:46 -0700 Subject: [maker-devel] scaffolds missing from master_datastore_index.log and all.gff files In-Reply-To: References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu> Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu> Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers. -Valerie > On Mar 15, 2018, at 8:26 AM, Carson Holt wrote: > > If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log. > > You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ). > > ?Carson > > > >> On Mar 14, 2018, at 6:21 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER. >> >> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED. >> >> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027. >> >> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12024 12024 313247 >> >> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86. >> >> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc >> 12026 12026 313295 >> >> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90. >> >> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. >> >> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file. >> >> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts. >> >> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From seoanezonjic at hotmail.com Fri Mar 16 03:33:28 2018 From: seoanezonjic at hotmail.com (p sz) Date: Fri, 16 Mar 2018 09:33:28 +0000 Subject: [maker-devel] Problems with failed contigs In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> References: , <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com> Message-ID: I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me? Thank you in advance ________________________________ From: Carson Holt Sent: Thursday, March 15, 2018 2:57:37 PM To: p sz Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Problems with failed contigs First make sure you are using the most current version (3.01.02). If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+ ?Carson On Mar 6, 2018, at 2:30 AM, p sz > wrote: Hi Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are: STARTED:3890 FINISHED:3378 So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites: substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850 and near this line, the following: ERROR: Failed while annotating transcripts My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? Thanks in advance _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Tue Mar 20 18:48:09 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 20 Mar 2018 17:48:09 -0700 Subject: [maker-devel] clarification on creating a standard build Message-ID: Hi MAKER community I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? Thanks. -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From urmi208 at gmail.com Wed Mar 21 03:05:42 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:05:42 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation Message-ID: Hello maker community, I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: 1. Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) 2. Create SNAP model with CEGMA 3. Train Augustus with BUSCO 4. Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) 5. Create SNAP model from run B. 6. Train Augustus with transcripts from run B and BUSCO 7. Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 As a result of this, I get following gene numbers: - run A: 12796 total genes out of which 12771 have AED < 0.5 - run B:10713 total genes out of which 10701 have AED < 0.5 - run C: 12651 total genes out of which 12582 have AED < 0.5 Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: *RunA* contig1 maker gene 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > > contig1 maker mRNA 20468 21193 100 + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > > contig1 maker exon 20468 21193 . + . >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > > contig1 blastn expressed_sequence_match 20468 21193 726 + >> . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >> target_length=726 > > contig1 blastn match_part 20468 21193 726 + . >> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > contig1 est2genome expressed_sequence_match 20468 21193 >> 3630 + . >> ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > > contig1 est2genome match_part 20468 21193 3630 + . >> >> ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunB:* > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > *RunC: * > contig1 maker gene 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > > contig1 maker mRNA 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > > contig1 maker exon 20468 21193 . + . >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 maker CDS 20468 21193 . + 0 >> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > > contig1 snap_masked match 20468 21193 42.956 + . >> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > > contig1 snap_masked match_part 20468 21193 42.956 + . >> >> ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 >> 1 726 +;Gap=M726 > > contig1 est_gff:est2genome expressed_sequence_match 20468 >> 21193 3630 + . >> ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > > contig1 est_gff:est2genome match_part 20468 21193 3630 + >> . >> ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est >> 1 726 +;Gap=M726 > > Please could anyone shed come light on this? Many thanks in advance. Urmi -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Wed Mar 21 03:24:32 2018 From: urmi208 at gmail.com (Urmi) Date: Wed, 21 Mar 2018 09:24:32 +0000 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: Further to this, I did run interproscan on all three runs and 100% of the genes from all of them have protein domains found. I am confused which one should I consider as the best annotation. I am sorry for so many questions but I am very new to maker. Thanks again for any help you could provide. On Wed, Mar 21, 2018 at 9:05 AM, Urmi wrote: > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > -- "The only way of finding the limits of the possible is by going beyond them into the impossible.*" **- Arthur C. Clarke* -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Mar 23 11:20:22 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:20:22 -0600 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: References: Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. You then have two alternate ways to get those models into your dataset. 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. ?Carson > On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: > > Hi MAKER community > > I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. > > I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: > "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? > > Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. > > What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? > > Thanks. > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Fri Mar 23 11:28:50 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 23 Mar 2018 11:28:50 -0600 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: References: Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models. Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity) Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss). Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html ?Carson > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using: > > Run maker with repeat masking and providing transcript and protein sequences from related species (Run A) > Create SNAP model with CEGMA > Train Augustus with BUSCO > Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3) > Create SNAP model from run B. > Train Augustus with transcripts from run B and BUSCO > Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1 > As a result of this, I get following gene numbers: > > run A: 12796 total genes out of which 12771 have AED < 0.5 > run B:10713 total genes out of which 10701 have AED < 0.5 > run C: 12651 total genes out of which 12582 have AED < 0.5 > Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example: > > RunA > > contig1 maker gene 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34 > contig1 maker mRNA 20468 21193 100 + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 > contig1 maker exon 20468 21193 . + . ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 > contig1 blastn expressed_sequence_match 20468 21193 726 + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726 > contig1 blastn match_part 20468 21193 726 + . ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > contig1 est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100 > contig1 est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunB: > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > RunC: > contig1 maker gene 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5 > contig1 maker mRNA 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 > contig1 maker exon 20468 21193 . + . ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 maker CDS 20468 21193 . + 0 ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 > contig1 snap_masked match 20468 21193 42.956 + . ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195 > contig1 snap_masked match_part 20468 21193 42.956 + . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 > contig1 est_gff:est2genome expressed_sequence_match 20468 21193 3630 + . ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726 > contig1 est_gff:est2genome match_part 20468 21193 3630 + . ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726 > > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From urmi208 at gmail.com Mon Mar 26 01:28:21 2018 From: urmi208 at gmail.com (Urmi) Date: Mon, 26 Mar 2018 08:28:21 +0100 Subject: [maker-devel] Gene loss in subsequent round of maker for fungal genome annotation In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> References: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com> Message-ID: That's great! Thanks for the tips Carson. Urmi On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt wrote: > Run A ?> no gene prediction, just cut and paste of transcript/protein > alignments to generate rough models. > Run B ?> Gene predictions based on training using only highly conserved > subset of genes (you will have low sensitivity) > Run C ?> Gene predictions based on training using broader gene set. Higher > sensitivity but potentially lower specificity (sensitivity gains should > outweigh any specificity loss). > > Finally, mnake sure you look at models in a browser to see how well > evidence and models overlap. If gene fusion is an issue (falsely merged > mRNA-seq assembly results will generate hints that can cause gene > predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/ > defusion/installation.html > > ?Carson > > > > On Mar 21, 2018, at 3:05 AM, Urmi wrote: > > Hello maker community, > > I am trying to run maker 3.01.02-beta on a fungal genome. I am using > available EST and protein sequences from a different strain of the same > species using parameters "est" and "protein" in the maker_opts.ctl file. > Here is the protocol I am using: > > 1. Run maker with repeat masking and providing transcript and protein > sequences from related species (Run A) > 2. Create SNAP model with CEGMA > 3. Train Augustus with BUSCO > 4. Run (run B ) with the new SNAP (done at step 2) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_ > protein2genome.gff3) > 5. Create SNAP model from run B. > 6. Train Augustus with transcripts from run B and BUSCO > 7. Run (run C ) with the new SNAP (done at step 5) and augustus > species with options turned off (est2genome=0) and (protein2genome=0) data, > provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), > keep_preds=1 > > As a result of this, I get following gene numbers: > > - run A: 12796 total genes out of which 12771 have AED < 0.5 > - run B:10713 total genes out of which 10701 have AED < 0.5 > - run C: 12651 total genes out of which 12582 have AED < 0.5 > > Looking at the gff files in detail, it is observerd that there are some > gene models in run A which are lost in run B and gain in run C. I don't > understand why there is gene loss for run B. Here is an example: > > *RunA* > > contig1 maker gene 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name= >>> maker-contig1-exonerate_protein2genome-gene-0.34 >> >> contig1 maker mRNA 20468 21193 100 + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1;Parent=maker-contig1-exonerate_protein2genome-gene- >>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene- >>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241 >> >> contig1 maker exon 20468 21193 . + . >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA- >>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1 >> >> contig1 blastn expressed_sequence_match 20468 21193 726 >>> + . ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est >>> target_length=726 >> >> contig1 blastn match_part 20468 21193 726 + . >>> ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> contig1 est2genome expressed_sequence_match 20468 21193 >>> 3630 + . ID=contig1:hit:1022:3.2.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100 >> >> contig1 est2genome match_part 20468 21193 3630 + >>> . ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est >>> 1 726 +;Gap=M726 >> >> > *RunB:* > >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > *RunC: * > >> contig1 maker gene 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5 >> >> contig1 maker mRNA 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1; >>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_ >>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0. >>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1 >> >> contig1 maker exon 20468 21193 . + . >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 maker CDS 20468 21193 . + 0 >>> ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds; >>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1 >> >> contig1 snap_masked match 20468 21193 42.956 + . >>> ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1- >>> abinit-gene-0.5-mRNA-1;target_length=4075195 >> >> contig1 snap_masked match_part 20468 21193 42.956 + >>> . ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0. >>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726 >> >> contig1 est_gff:est2genome expressed_sequence_match 20468 >>> 21193 3630 + . ID=contig1:hit:1051:3.12.0.0; >>> Name=jgi|test_1|140804|est;target_length=726;aligned_ >>> coverage=100;aligned_identity=100;aligned_coverage=100; >>> aligned_identity=100;score=3630;target_length=726 >> >> contig1 est_gff:est2genome match_part 20468 21193 3630 >>> + . ID=contig1:hsp:1166:3.12.0.0; >>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 >>> +;Gap=M726 >> >> > Please could anyone shed come light on this? > > > Many thanks in advance. > > Urmi > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vsoza at uw.edu Mon Mar 26 12:49:24 2018 From: vsoza at uw.edu (Valerie Soza) Date: Mon, 26 Mar 2018 11:49:24 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Hi Carson Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. I created the .gff file by this command: gff3_merge -d Rwill7_master_datastore_index.log I created the .fasta files by this command: fasta_merge -d Rwill7_master_datastore_index.log I ran InterProScan with this command: interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff #no results There is no "processed-gene" with this ID in the Rwill7.all.gff file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta #no results using the ?abinit-gene? Name from the .gff file versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? Thanks for your help. -Valerie > On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: > > You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. > > All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. > > You then have two alternate ways to get those models into your dataset. > > 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. > > That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. > > 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. > > This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. > > ?Carson > > > >> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >> >> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >> >> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >> >> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >> >> Thanks. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Tue Mar 27 10:50:38 2018 From: vsoza at uw.edu (Valerie Soza) Date: Tue, 27 Mar 2018 09:50:38 -0700 Subject: [maker-devel] how to output masked genome from MAKER In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> References: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com> Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu> Hi Carson Thanks, that is simple and it worked. I did the following to sort and concatenate the query.masked.fasta files into one fasta: $ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta -Valerie > On Mar 15, 2018, at 8:31 AM, Carson Holt wrote: > > You will just have to find and concatenate the files yourself. > > Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta > > ?Carson > > >> On Mar 7, 2018, at 2:19 PM, Valerie Soza wrote: >> >> Hi MAKER community >> >> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files? >> >> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? >> >> Thanks for any help or insights. >> >> -Valerie >> >> Valerie Soza, Ph.D. >> c/o Hall Lab >> Department of Biology >> University of Washington >> Johnson Hall 202A >> Box 351800 >> Seattle, WA 98195-1800 >> 206-543-6740 >> http://staff.washington.edu/vsoza/ >> >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From vsoza at uw.edu Thu Mar 29 12:42:28 2018 From: vsoza at uw.edu (Valerie Soza) Date: Thu, 29 Mar 2018 11:42:28 -0700 Subject: [maker-devel] clarification on creating a standard build In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> References: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com> <57B30565-1603-4723-AF74-FEB54F735899@uw.edu> Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu> Hi MAKER community, I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file. I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed Then I extracted only the IDs from the .tsv file to grep against the all.gff file. cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep. sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :) -Valerie > On Mar 26, 2018, at 11:49 AM, Valerie Soza wrote: > > Hi Carson > > Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below. > > I created the .gff file by this command: > gff3_merge -d Rwill7_master_datastore_index.log > > I created the .fasta files by this command: > fasta_merge -d Rwill7_master_datastore_index.log > > I ran InterProScan with this command: > interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > > When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below: > > $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > > snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 d146190e642a740520c9 > 7a782a74fe32 356 Pfam PF13365 Trypsin-like peptidase domain 77 2281.4E-17 T 20-03-2018 > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff > #no results > > There is no "processed-gene" with this ID in the Rwill7.all.gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff > > LG12_ordered_scaffold_85 maker gene 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8 > LG12_ordered_scaffold_85 maker mRNA 63727 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200 > LG12_ordered_scaffold_85 maker exon 63727 63768 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64269 64340 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 64896 65000 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65268 65327 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 65716 65915 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker exon 66930 67053 . + . ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 63727 63768 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64269 64340 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 64896 65000 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65268 65327 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 65716 65915 . + 0 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > LG12_ordered_scaffold_85 maker CDS 66930 67053 . + 1 ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1 > > However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file: > > $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff > > #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1? > > LG12_ordered_scaffold_85 snap_masked match 101798 108141 35.366 + ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 > LG12_ordered_scaffold_85 snap_masked match_part 101798 102633 35.236 ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836 > LG12_ordered_scaffold_85 snap_masked match_part 107907 108141 0.130 ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235 > > So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present: > > $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta > #no results using the ?abinit-gene? Name from the .gff file > > versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file: > > $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta >> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356 > > I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct? > > If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct? > > Thanks for your help. > > -Valerie > >> On Mar 23, 2018, at 10:20 AM, Carson Holt wrote: >> >> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3. >> >> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff. >> >> You then have two alternate ways to get those models into your dataset. >> >> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1. >> >> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge. >> >> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1. >> >> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything. >> >> ?Carson >> >> >> >>> On Mar 20, 2018, at 6:48 PM, Valerie Soza wrote: >>> >>> Hi MAKER community >>> >>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ. >>> >>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now: >>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.? >>> >>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. >>> >>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3? >>> >>> Thanks. >>> >>> -Valerie >>> >>> Valerie Soza, Ph.D. >>> c/o Hall Lab >>> Department of Biology >>> University of Washington >>> Johnson Hall 202A >>> Box 351800 >>> Seattle, WA 98195-1800 >>> 206-543-6740 >>> http://staff.washington.edu/vsoza/ >>> >>> >>> _______________________________________________ >>> maker-devel mailing list >>> maker-devel at box290.bluehost.com >>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org >> > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/