From seoanezonjic at hotmail.com  Tue Mar  6 03:30:24 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Tue, 6 Mar 2018 09:30:24 +0000
Subject: [maker-devel] Problems with failed contigs
Message-ID: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180306/c3737e92/attachment.html>

From vsoza at uw.edu  Wed Mar  7 15:19:15 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 7 Mar 2018 13:19:15 -0800
Subject: [maker-devel] how to output masked genome from MAKER
Message-ID: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>

Hi MAKER community

I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?

I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 

Thanks for any help or insights.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From flopezo84 at gmail.com  Fri Mar  9 10:15:39 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Fri, 9 Mar 2018 11:15:39 -0500
Subject: [maker-devel] Using PASA assemblies with MAKER
Message-ID: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>

Hello,

I was wondering what might be the recommended option for using PASA2
alignment assemblies with MAKER3:

1. PASA assemblies in FASTA format (est)
2. PASA assembly structures (est_gff)
3. ORFs from PASA assemblies (protein)

And related to this question, when I use the PASA2 assembly structures in
GFF3 format, MAKER reports the error below.

"ERROR: Non-unique top level ID for..."

I suppose all the non-unique IDs need to be renamed for MAKER?

Any help is greatly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180309/6f987aaf/attachment.html>

From wangzhennan at ioz.ac.cn  Tue Mar 13 22:53:44 2018
From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn)
Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00)
Subject: [maker-devel] Some transcripts have no AED?
Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>

Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180314/9bd868f4/attachment.html>

From d.ence at ufl.edu  Wed Mar 14 06:33:01 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Wed, 14 Mar 2018 11:33:01 +0000
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu>

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180314/74e08034/attachment.html>

From seoanezonjic at hotmail.com  Wed Mar 14 08:52:12 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Wed, 14 Mar 2018 13:52:12 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <DB6PR0102MB2709EA5CAB5E8F5B46FDA541D1D10@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times:

substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850.
--> rank=15, hostname=dx095
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Sosen1_s1284
ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Sosen1_s1284

Can you help me to fix this problem?
Thank you in advance
Pedro Seoane

________________________________
De: p sz <seoanezonjic at hotmail.com>
Enviado: martes, 6 de marzo de 2018 9:30
Para: maker-devel at yandell-lab.org
Asunto: Problems with failed contigs

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180314/43255060/attachment.html>

From vsoza at uw.edu  Wed Mar 14 19:21:26 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 14 Mar 2018 17:21:26 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log and
 all.gff files
Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>

Hi MAKER community

I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.

In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.

To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.

$ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
12024   12024  313247

3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.

$ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
12026   12026  313295

1 finished scaffold missing from this file is LG08_unordered_scaffold_90.

I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 

After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.

I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.

Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From d.ence at ufl.edu  Thu Mar 15 08:15:00 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Thu, 15 Mar 2018 13:15:00 +0000
Subject: [maker-devel] Fwd:  Some transcripts have no AED?
References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu>
Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu>


Begin forwarded message:

From: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Subject: Re: [maker-devel] Some transcripts have no AED?
Date: March 15, 2018 at 9:06:45 AM EDT
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>

Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results.

~Daniel


On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn> wrote:


Hi,

   I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much!

   Best wishes.


                                                                                                                                                                               Wang


-----Original Messages-----
From:"Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Sent Time:2018-03-14 19:33:01 (Wednesday)
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] Some transcripts have no AED?

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180315/fd2e8a08/attachment.html>

From carsonhh at gmail.com  Thu Mar 15 09:57:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 08:57:37 -0600
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

> On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com> wrote:
> 
> Hi
> Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
> STARTED:3890
> FINISHED:3378
> So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
> substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
> and near this line, the following:
> ERROR: Failed while annotating transcripts
> My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? 
> Thanks in advance
> 
> 
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180315/cfdb76f6/attachment.html>

From carsonhh at gmail.com  Thu Mar 15 10:15:09 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:15:09 -0600
Subject: [maker-devel] Using PASA assemblies with MAKER
In-Reply-To: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
References: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com>

MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>

I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3.

?Carson


> On Mar 9, 2018, at 9:15 AM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3:
> 
> 1. PASA assemblies in FASTA format (est)
> 2. PASA assembly structures (est_gff)
> 3. ORFs from PASA assemblies (protein)
> 
> And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below.
> 
> "ERROR: Non-unique top level ID for..."
> 
> I suppose all the non-unique IDs need to be renamed for MAKER?
> 
> Any help is greatly appreciated.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180315/6bd04db1/attachment.html>

From carsonhh at gmail.com  Thu Mar 15 10:20:08 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:20:08 -0600
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <B9C2C216-B995-4478-91EE-3DBDF7A7F112@gmail.com>

Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence.

Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes).

Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values.

You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. 

pred_stats=0 #report AED and QI statistics for all predictions as well as models

?Carson

> On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote:
> 
> Hi,
> 
>    When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.
> 
>    Best wishes.
> 
> 
> 
>                                                                                                                                                                Wang
> 
> T
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180315/01a5c3ac/attachment.html>

From carsonhh at gmail.com  Thu Mar 15 10:26:26 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:26:26 -0600
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
Message-ID: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>

If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.

You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).

?Carson


> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
> 
> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
> 
> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
> 
> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
> 12024   12024  313247
> 
> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
> 
> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
> 12026   12026  313295
> 
> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
> 
> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
> 
> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
> 
> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
> 
> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Mar 15 10:31:31 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:31:31 -0600
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>

You will just have to find and concatenate the files yourself.

Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta

?Carson


> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
> 
> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
> 
> Thanks for any help or insights.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From vsoza at uw.edu  Thu Mar 15 13:18:46 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 15 Mar 2018 11:18:46 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
	<A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu>

Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers.

-Valerie

> On Mar 15, 2018, at 8:26 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.
> 
> You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).
> 
> ?Carson
> 
> 
> 
>> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
>> 
>> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
>> 
>> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
>> 
>> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
>> 12024   12024  313247
>> 
>> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
>> 
>> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>> 12026   12026  313295
>> 
>> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
>> 
>> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
>> 
>> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
>> 
>> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
>> 
>> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Fri Mar 16 04:33:28 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Fri, 16 Mar 2018 09:33:28 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>,
	<2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
Message-ID: <DB6PR0102MB270989C66E53B92567ED2226D1D00@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that  the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me?
Thank you in advance
________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Thursday, March 15, 2018 2:57:37 PM
To: p sz
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Problems with failed contigs

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com<mailto:seoanezonjic at hotmail.com>> wrote:

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180316/5f91aa0a/attachment.html>

From vsoza at uw.edu  Tue Mar 20 19:48:09 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 20 Mar 2018 17:48:09 -0700
Subject: [maker-devel] clarification on creating a standard build
Message-ID: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>

Hi MAKER community

I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.

I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
"One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?

Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 

What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From urmi208 at gmail.com  Wed Mar 21 04:05:42 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:05:42 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
Message-ID: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>

Hello maker community,

I am trying to run maker 3.01.02-beta on a fungal genome. I am using
available EST and protein sequences from a different strain of the same
species using parameters "est" and "protein" in the maker_opts.ctl file.
Here is the protocol I am using:

   1. Run maker with repeat masking and providing transcript and protein
   sequences from related species (Run A)
   2. Create SNAP model with CEGMA
   3. Train Augustus with BUSCO
   4. Run (run B ) with the new SNAP (done at step 2) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3)
   5. Create SNAP model from run B.
   6. Train Augustus with transcripts from run B and BUSCO
   7. Run (run C ) with the new SNAP (done at step 5) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3), keep_preds=1

As a result of this, I get following gene numbers:

   - run A: 12796 total genes out of which 12771 have AED < 0.5
   - run B:10713 total genes out of which 10701 have AED < 0.5
   - run C: 12651 total genes out of which 12582 have AED < 0.5

Looking at the gff files in detail, it is observerd that there are some
gene models in run A which are lost in run B and gain in run C. I don't
understand why there is gene loss for run B. Here is an example:

*RunA*

contig1 maker   gene    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
>
> contig1 maker   mRNA    20468   21193   100     +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 blastn  expressed_sequence_match        20468   21193   726     +
>>      .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>> target_length=726
>
> contig1 blastn  match_part      20468   21193   726     +       .
>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
> contig1 est2genome      expressed_sequence_match        20468   21193
>>  3630    +       .
>>  ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
>
> contig1 est2genome      match_part      20468   21193   3630    +       .
>>
>>  ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunB:*

> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunC: *

> contig1 maker   gene    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
>
> contig1 maker   mRNA    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 snap_masked     match   20468   21193   42.956  +       .
>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
>
> contig1 snap_masked     match_part      20468   21193   42.956  +       .
>>
>>  ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1
>> 1 726 +;Gap=M726
>
> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
Please could anyone shed come light on this?


Many thanks in advance.

Urmi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180321/59cc4c6c/attachment.html>

From urmi208 at gmail.com  Wed Mar 21 04:24:32 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:24:32 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <CAGe_+EsxV0KWrbYt01RMOzBDSvUBGMmaMu=c4t4ubfWPmsuEyQ@mail.gmail.com>

Further to this, I did run interproscan on all three runs and 100% of the
genes from all of them have protein domains found. I am confused which one
should I consider as the best annotation. I am sorry for so many questions
but I am very new to maker.

Thanks again for any help you could provide.

On Wed, Mar 21, 2018 at 9:05 AM, Urmi <urmi208 at gmail.com> wrote:

> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
>


-- 
"The only way of finding the limits of the possible is by going beyond them
into the impossible.*" **- Arthur C. Clarke*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180321/c0cc3e5d/attachment.html>

From carsonhh at gmail.com  Fri Mar 23 12:20:22 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:20:22 -0600
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>

You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.

All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.

You then have two alternate ways to get those models into your dataset.

1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.

That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.

2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.

This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.

?Carson


> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
> 
> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
> 
> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
> 
> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Fri Mar 23 12:28:50 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:28:50 -0600
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>

Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models.
Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity)
Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss).

Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html

?Carson


> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
> 
> Hello maker community,
> 
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using:
> 
> Run maker with repeat masking and providing transcript and protein sequences from related species (Run A)
> Create SNAP model with CEGMA
> Train Augustus with BUSCO
> Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3)
> Create SNAP model from run B.
> Train Augustus with transcripts from run B and BUSCO
> Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1
> As a result of this, I get following gene numbers:
> 
> run A: 12796 total genes out of which 12771 have AED < 0.5
> run B:10713 total genes out of which 10701 have AED < 0.5
> run C: 12651 total genes out of which 12582 have AED < 0.5
> Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example:
> 
> RunA
> 
> contig1 maker   gene    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
> contig1 maker   mRNA    20468   21193   100     +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
> contig1 maker   exon    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 blastn  expressed_sequence_match        20468   21193   726     +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726
> contig1 blastn  match_part      20468   21193   726     +       .       ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> contig1 est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
> contig1 est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunB:
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunC: 
> contig1 maker   gene    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
> contig1 maker   mRNA    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
> contig1 maker   exon    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 snap_masked     match   20468   21193   42.956  +       .       ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
> contig1 snap_masked     match_part      20468   21193   42.956  +       .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> Please could anyone shed come light on this?
> 
> 
> Many thanks in advance.
> 
> Urmi
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180323/bcfd8abc/attachment.html>

From urmi208 at gmail.com  Mon Mar 26 02:28:21 2018
From: urmi208 at gmail.com (Urmi)
Date: Mon, 26 Mar 2018 08:28:21 +0100
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
	<7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
Message-ID: <CAGe_+EuGU0P4OdHR2cxvNSAKQN24FvW3-9YEFv70uNvDZYxVmQ@mail.gmail.com>

That's great! Thanks for the tips Carson.

Urmi

On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Run A ?> no gene prediction, just cut and paste of transcript/protein
> alignments to generate rough models.
> Run B ?> Gene predictions based on training using only highly conserved
> subset of genes (you will have low sensitivity)
> Run C ?> Gene predictions based on training using broader gene set. Higher
> sensitivity but potentially lower specificity (sensitivity gains should
> outweigh any specificity loss).
>
> Finally, mnake sure you look at models in a browser to see how well
> evidence and models overlap. If gene fusion is an issue (falsely merged
> mRNA-seq assembly results will generate hints that can cause gene
> predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/
> defusion/installation.html
>
> ?Carson
>
>
>
> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
>
> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180326/a27dbeb1/attachment.html>

From vsoza at uw.edu  Mon Mar 26 13:49:24 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Mon, 26 Mar 2018 11:49:24 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>

Hi Carson

Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.

I created the .gff file by this command:
gff3_merge -d Rwill7_master_datastore_index.log

I created the .fasta files by this command:
fasta_merge -d Rwill7_master_datastore_index.log

I ran InterProScan with this command:
interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta

When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
 
$ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv

snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
#no results

There is no "processed-gene" with this ID in the Rwill7.all.gff file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff

LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1

However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:

$ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff

#some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?

LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235

So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:

$ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
#no results using the ?abinit-gene? Name from the .gff file

versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356

I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?

If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?

Thanks for your help.

-Valerie

> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
> 
> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
> 
> You then have two alternate ways to get those models into your dataset.
> 
> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
> 
> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
> 
> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
> 
> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
> 
> ?Carson
> 
> 
> 
>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>> 
>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>> 
>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>> 
>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Tue Mar 27 11:50:38 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 27 Mar 2018 09:50:38 -0700
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
	<15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu>

Hi Carson

Thanks, that is simple and it worked.

I did the following to sort and concatenate the query.masked.fasta files into one fasta:

$ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta

-Valerie

> On Mar 15, 2018, at 8:31 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will just have to find and concatenate the files yourself.
> 
> Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta
> 
> ?Carson
> 
> 
>> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
>> 
>> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
>> 
>> Thanks for any help or insights.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Thu Mar 29 13:42:28 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 29 Mar 2018 11:42:28 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
	<57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu>

Hi MAKER community,

I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file.

I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. 

sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

Then I extracted only the IDs from the .tsv file to grep against the all.gff file.

cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep.

sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :)

-Valerie


> On Mar 26, 2018, at 11:49 AM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Carson
> 
> Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.
> 
> I created the .gff file by this command:
> gff3_merge -d Rwill7_master_datastore_index.log
> 
> I created the .fasta files by this command:
> fasta_merge -d Rwill7_master_datastore_index.log
> 
> I ran InterProScan with this command:
> interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
> 
> $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv
> 
> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
> 7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
> #no results
> 
> There is no "processed-gene" with this ID in the Rwill7.all.gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff
> 
> LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
> LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
> LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> 
> However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff
> 
> #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?
> 
> LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
> LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
> LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235
> 
> So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> #no results using the ?abinit-gene? Name from the .gff file
> 
> versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356
> 
> I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?
> 
> If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?
> 
> Thanks for your help.
> 
> -Valerie
> 
>> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
>> 
>> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
>> 
>> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
>> 
>> You then have two alternate ways to get those models into your dataset.
>> 
>> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
>> 
>> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
>> 
>> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
>> 
>> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>>> 
>>> Hi MAKER community
>>> 
>>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>>> 
>>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>>> 
>>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>>> 
>>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>>> 
>>> Thanks.
>>> 
>>> -Valerie
>>> 
>>> Valerie Soza, Ph.D.
>>> c/o Hall Lab
>>> Department of Biology
>>> University of Washington
>>> Johnson Hall 202A
>>> Box 351800
>>> Seattle, WA 98195-1800
>>> 206-543-6740
>>> http://staff.washington.edu/vsoza/
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Tue Mar  6 02:30:24 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Tue, 6 Mar 2018 09:30:24 +0000
Subject: [maker-devel] Problems with failed contigs
Message-ID: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180306/c3737e92/attachment-0001.html>

From vsoza at uw.edu  Wed Mar  7 14:19:15 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 7 Mar 2018 13:19:15 -0800
Subject: [maker-devel] how to output masked genome from MAKER
Message-ID: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>

Hi MAKER community

I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?

I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 

Thanks for any help or insights.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From flopezo84 at gmail.com  Fri Mar  9 09:15:39 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Fri, 9 Mar 2018 11:15:39 -0500
Subject: [maker-devel] Using PASA assemblies with MAKER
Message-ID: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>

Hello,

I was wondering what might be the recommended option for using PASA2
alignment assemblies with MAKER3:

1. PASA assemblies in FASTA format (est)
2. PASA assembly structures (est_gff)
3. ORFs from PASA assemblies (protein)

And related to this question, when I use the PASA2 assembly structures in
GFF3 format, MAKER reports the error below.

"ERROR: Non-unique top level ID for..."

I suppose all the non-unique IDs need to be renamed for MAKER?

Any help is greatly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180309/6f987aaf/attachment-0001.html>

From wangzhennan at ioz.ac.cn  Tue Mar 13 21:53:44 2018
From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn)
Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00)
Subject: [maker-devel] Some transcripts have no AED?
Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>

Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/9bd868f4/attachment-0001.html>

From d.ence at ufl.edu  Wed Mar 14 05:33:01 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Wed, 14 Mar 2018 11:33:01 +0000
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu>

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/74e08034/attachment-0001.html>

From seoanezonjic at hotmail.com  Wed Mar 14 07:52:12 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Wed, 14 Mar 2018 13:52:12 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <DB6PR0102MB2709EA5CAB5E8F5B46FDA541D1D10@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times:

substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850.
--> rank=15, hostname=dx095
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Sosen1_s1284
ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Sosen1_s1284

Can you help me to fix this problem?
Thank you in advance
Pedro Seoane

________________________________
De: p sz <seoanezonjic at hotmail.com>
Enviado: martes, 6 de marzo de 2018 9:30
Para: maker-devel at yandell-lab.org
Asunto: Problems with failed contigs

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/43255060/attachment-0001.html>

From vsoza at uw.edu  Wed Mar 14 18:21:26 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 14 Mar 2018 17:21:26 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log and
 all.gff files
Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>

Hi MAKER community

I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.

In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.

To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.

$ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
12024   12024  313247

3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.

$ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
12026   12026  313295

1 finished scaffold missing from this file is LG08_unordered_scaffold_90.

I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 

After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.

I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.

Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From d.ence at ufl.edu  Thu Mar 15 07:15:00 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Thu, 15 Mar 2018 13:15:00 +0000
Subject: [maker-devel] Fwd:  Some transcripts have no AED?
References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu>
Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu>


Begin forwarded message:

From: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Subject: Re: [maker-devel] Some transcripts have no AED?
Date: March 15, 2018 at 9:06:45 AM EDT
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>

Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results.

~Daniel


On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn> wrote:


Hi,

   I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much!

   Best wishes.


                                                                                                                                                                               Wang


-----Original Messages-----
From:"Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Sent Time:2018-03-14 19:33:01 (Wednesday)
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] Some transcripts have no AED?

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/fd2e8a08/attachment-0001.html>

From carsonhh at gmail.com  Thu Mar 15 08:57:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 08:57:37 -0600
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

> On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com> wrote:
> 
> Hi
> Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
> STARTED:3890
> FINISHED:3378
> So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
> substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
> and near this line, the following:
> ERROR: Failed while annotating transcripts
> My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? 
> Thanks in advance
> 
> 
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/cfdb76f6/attachment-0001.html>

From carsonhh at gmail.com  Thu Mar 15 09:15:09 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:15:09 -0600
Subject: [maker-devel] Using PASA assemblies with MAKER
In-Reply-To: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
References: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com>

MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>

I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3.

?Carson


> On Mar 9, 2018, at 9:15 AM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3:
> 
> 1. PASA assemblies in FASTA format (est)
> 2. PASA assembly structures (est_gff)
> 3. ORFs from PASA assemblies (protein)
> 
> And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below.
> 
> "ERROR: Non-unique top level ID for..."
> 
> I suppose all the non-unique IDs need to be renamed for MAKER?
> 
> Any help is greatly appreciated.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/6bd04db1/attachment-0001.html>

From carsonhh at gmail.com  Thu Mar 15 09:20:08 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:20:08 -0600
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <B9C2C216-B995-4478-91EE-3DBDF7A7F112@gmail.com>

Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence.

Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes).

Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values.

You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. 

pred_stats=0 #report AED and QI statistics for all predictions as well as models

?Carson

> On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote:
> 
> Hi,
> 
>    When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.
> 
>    Best wishes.
> 
> 
> 
>                                                                                                                                                                Wang
> 
> T
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/01a5c3ac/attachment-0001.html>

From carsonhh at gmail.com  Thu Mar 15 09:26:26 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:26:26 -0600
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
Message-ID: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>

If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.

You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).

?Carson


> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
> 
> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
> 
> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
> 
> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
> 12024   12024  313247
> 
> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
> 
> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
> 12026   12026  313295
> 
> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
> 
> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
> 
> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
> 
> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
> 
> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Mar 15 09:31:31 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:31:31 -0600
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>

You will just have to find and concatenate the files yourself.

Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta

?Carson


> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
> 
> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
> 
> Thanks for any help or insights.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From vsoza at uw.edu  Thu Mar 15 12:18:46 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 15 Mar 2018 11:18:46 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
	<A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu>

Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers.

-Valerie

> On Mar 15, 2018, at 8:26 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.
> 
> You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).
> 
> ?Carson
> 
> 
> 
>> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
>> 
>> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
>> 
>> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
>> 
>> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
>> 12024   12024  313247
>> 
>> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
>> 
>> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>> 12026   12026  313295
>> 
>> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
>> 
>> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
>> 
>> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
>> 
>> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
>> 
>> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Fri Mar 16 03:33:28 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Fri, 16 Mar 2018 09:33:28 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>,
	<2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
Message-ID: <DB6PR0102MB270989C66E53B92567ED2226D1D00@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that  the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me?
Thank you in advance
________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Thursday, March 15, 2018 2:57:37 PM
To: p sz
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Problems with failed contigs

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com<mailto:seoanezonjic at hotmail.com>> wrote:

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180316/5f91aa0a/attachment-0001.html>

From vsoza at uw.edu  Tue Mar 20 18:48:09 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 20 Mar 2018 17:48:09 -0700
Subject: [maker-devel] clarification on creating a standard build
Message-ID: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>

Hi MAKER community

I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.

I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
"One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?

Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 

What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From urmi208 at gmail.com  Wed Mar 21 03:05:42 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:05:42 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
Message-ID: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>

Hello maker community,

I am trying to run maker 3.01.02-beta on a fungal genome. I am using
available EST and protein sequences from a different strain of the same
species using parameters "est" and "protein" in the maker_opts.ctl file.
Here is the protocol I am using:

   1. Run maker with repeat masking and providing transcript and protein
   sequences from related species (Run A)
   2. Create SNAP model with CEGMA
   3. Train Augustus with BUSCO
   4. Run (run B ) with the new SNAP (done at step 2) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3)
   5. Create SNAP model from run B.
   6. Train Augustus with transcripts from run B and BUSCO
   7. Run (run C ) with the new SNAP (done at step 5) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3), keep_preds=1

As a result of this, I get following gene numbers:

   - run A: 12796 total genes out of which 12771 have AED < 0.5
   - run B:10713 total genes out of which 10701 have AED < 0.5
   - run C: 12651 total genes out of which 12582 have AED < 0.5

Looking at the gff files in detail, it is observerd that there are some
gene models in run A which are lost in run B and gain in run C. I don't
understand why there is gene loss for run B. Here is an example:

*RunA*

contig1 maker   gene    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
>
> contig1 maker   mRNA    20468   21193   100     +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 blastn  expressed_sequence_match        20468   21193   726     +
>>      .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>> target_length=726
>
> contig1 blastn  match_part      20468   21193   726     +       .
>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
> contig1 est2genome      expressed_sequence_match        20468   21193
>>  3630    +       .
>>  ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
>
> contig1 est2genome      match_part      20468   21193   3630    +       .
>>
>>  ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunB:*

> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunC: *

> contig1 maker   gene    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
>
> contig1 maker   mRNA    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 snap_masked     match   20468   21193   42.956  +       .
>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
>
> contig1 snap_masked     match_part      20468   21193   42.956  +       .
>>
>>  ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1
>> 1 726 +;Gap=M726
>
> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
Please could anyone shed come light on this?


Many thanks in advance.

Urmi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180321/59cc4c6c/attachment-0001.html>

From urmi208 at gmail.com  Wed Mar 21 03:24:32 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:24:32 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <CAGe_+EsxV0KWrbYt01RMOzBDSvUBGMmaMu=c4t4ubfWPmsuEyQ@mail.gmail.com>

Further to this, I did run interproscan on all three runs and 100% of the
genes from all of them have protein domains found. I am confused which one
should I consider as the best annotation. I am sorry for so many questions
but I am very new to maker.

Thanks again for any help you could provide.

On Wed, Mar 21, 2018 at 9:05 AM, Urmi <urmi208 at gmail.com> wrote:

> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
>


-- 
"The only way of finding the limits of the possible is by going beyond them
into the impossible.*" **- Arthur C. Clarke*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180321/c0cc3e5d/attachment-0001.html>

From carsonhh at gmail.com  Fri Mar 23 11:20:22 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:20:22 -0600
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>

You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.

All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.

You then have two alternate ways to get those models into your dataset.

1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.

That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.

2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.

This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.

?Carson


> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
> 
> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
> 
> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
> 
> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Fri Mar 23 11:28:50 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:28:50 -0600
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>

Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models.
Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity)
Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss).

Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html

?Carson


> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
> 
> Hello maker community,
> 
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using:
> 
> Run maker with repeat masking and providing transcript and protein sequences from related species (Run A)
> Create SNAP model with CEGMA
> Train Augustus with BUSCO
> Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3)
> Create SNAP model from run B.
> Train Augustus with transcripts from run B and BUSCO
> Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1
> As a result of this, I get following gene numbers:
> 
> run A: 12796 total genes out of which 12771 have AED < 0.5
> run B:10713 total genes out of which 10701 have AED < 0.5
> run C: 12651 total genes out of which 12582 have AED < 0.5
> Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example:
> 
> RunA
> 
> contig1 maker   gene    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
> contig1 maker   mRNA    20468   21193   100     +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
> contig1 maker   exon    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 blastn  expressed_sequence_match        20468   21193   726     +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726
> contig1 blastn  match_part      20468   21193   726     +       .       ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> contig1 est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
> contig1 est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunB:
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunC: 
> contig1 maker   gene    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
> contig1 maker   mRNA    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
> contig1 maker   exon    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 snap_masked     match   20468   21193   42.956  +       .       ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
> contig1 snap_masked     match_part      20468   21193   42.956  +       .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> Please could anyone shed come light on this?
> 
> 
> Many thanks in advance.
> 
> Urmi
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180323/bcfd8abc/attachment-0001.html>

From urmi208 at gmail.com  Mon Mar 26 01:28:21 2018
From: urmi208 at gmail.com (Urmi)
Date: Mon, 26 Mar 2018 08:28:21 +0100
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
	<7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
Message-ID: <CAGe_+EuGU0P4OdHR2cxvNSAKQN24FvW3-9YEFv70uNvDZYxVmQ@mail.gmail.com>

That's great! Thanks for the tips Carson.

Urmi

On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Run A ?> no gene prediction, just cut and paste of transcript/protein
> alignments to generate rough models.
> Run B ?> Gene predictions based on training using only highly conserved
> subset of genes (you will have low sensitivity)
> Run C ?> Gene predictions based on training using broader gene set. Higher
> sensitivity but potentially lower specificity (sensitivity gains should
> outweigh any specificity loss).
>
> Finally, mnake sure you look at models in a browser to see how well
> evidence and models overlap. If gene fusion is an issue (falsely merged
> mRNA-seq assembly results will generate hints that can cause gene
> predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/
> defusion/installation.html
>
> ?Carson
>
>
>
> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
>
> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180326/a27dbeb1/attachment-0001.html>

From vsoza at uw.edu  Mon Mar 26 12:49:24 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Mon, 26 Mar 2018 11:49:24 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>

Hi Carson

Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.

I created the .gff file by this command:
gff3_merge -d Rwill7_master_datastore_index.log

I created the .fasta files by this command:
fasta_merge -d Rwill7_master_datastore_index.log

I ran InterProScan with this command:
interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta

When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
 
$ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv

snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
#no results

There is no "processed-gene" with this ID in the Rwill7.all.gff file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff

LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1

However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:

$ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff

#some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?

LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235

So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:

$ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
#no results using the ?abinit-gene? Name from the .gff file

versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356

I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?

If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?

Thanks for your help.

-Valerie

> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
> 
> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
> 
> You then have two alternate ways to get those models into your dataset.
> 
> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
> 
> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
> 
> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
> 
> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
> 
> ?Carson
> 
> 
> 
>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>> 
>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>> 
>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>> 
>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Tue Mar 27 10:50:38 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 27 Mar 2018 09:50:38 -0700
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
	<15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu>

Hi Carson

Thanks, that is simple and it worked.

I did the following to sort and concatenate the query.masked.fasta files into one fasta:

$ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta

-Valerie

> On Mar 15, 2018, at 8:31 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will just have to find and concatenate the files yourself.
> 
> Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta
> 
> ?Carson
> 
> 
>> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
>> 
>> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
>> 
>> Thanks for any help or insights.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Thu Mar 29 12:42:28 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 29 Mar 2018 11:42:28 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
	<57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu>

Hi MAKER community,

I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file.

I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. 

sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

Then I extracted only the IDs from the .tsv file to grep against the all.gff file.

cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep.

sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :)

-Valerie


> On Mar 26, 2018, at 11:49 AM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Carson
> 
> Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.
> 
> I created the .gff file by this command:
> gff3_merge -d Rwill7_master_datastore_index.log
> 
> I created the .fasta files by this command:
> fasta_merge -d Rwill7_master_datastore_index.log
> 
> I ran InterProScan with this command:
> interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
> 
> $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv
> 
> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
> 7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
> #no results
> 
> There is no "processed-gene" with this ID in the Rwill7.all.gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff
> 
> LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
> LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
> LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> 
> However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff
> 
> #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?
> 
> LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
> LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
> LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235
> 
> So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> #no results using the ?abinit-gene? Name from the .gff file
> 
> versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356
> 
> I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?
> 
> If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?
> 
> Thanks for your help.
> 
> -Valerie
> 
>> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
>> 
>> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
>> 
>> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
>> 
>> You then have two alternate ways to get those models into your dataset.
>> 
>> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
>> 
>> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
>> 
>> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
>> 
>> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>>> 
>>> Hi MAKER community
>>> 
>>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>>> 
>>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>>> 
>>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>>> 
>>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>>> 
>>> Thanks.
>>> 
>>> -Valerie
>>> 
>>> Valerie Soza, Ph.D.
>>> c/o Hall Lab
>>> Department of Biology
>>> University of Washington
>>> Johnson Hall 202A
>>> Box 351800
>>> Seattle, WA 98195-1800
>>> 206-543-6740
>>> http://staff.washington.edu/vsoza/
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Tue Mar  6 02:30:24 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Tue, 6 Mar 2018 09:30:24 +0000
Subject: [maker-devel] Problems with failed contigs
Message-ID: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180306/c3737e92/attachment-0002.html>

From vsoza at uw.edu  Wed Mar  7 14:19:15 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 7 Mar 2018 13:19:15 -0800
Subject: [maker-devel] how to output masked genome from MAKER
Message-ID: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>

Hi MAKER community

I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?

I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 

Thanks for any help or insights.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From flopezo84 at gmail.com  Fri Mar  9 09:15:39 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Fri, 9 Mar 2018 11:15:39 -0500
Subject: [maker-devel] Using PASA assemblies with MAKER
Message-ID: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>

Hello,

I was wondering what might be the recommended option for using PASA2
alignment assemblies with MAKER3:

1. PASA assemblies in FASTA format (est)
2. PASA assembly structures (est_gff)
3. ORFs from PASA assemblies (protein)

And related to this question, when I use the PASA2 assembly structures in
GFF3 format, MAKER reports the error below.

"ERROR: Non-unique top level ID for..."

I suppose all the non-unique IDs need to be renamed for MAKER?

Any help is greatly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180309/6f987aaf/attachment-0002.html>

From wangzhennan at ioz.ac.cn  Tue Mar 13 21:53:44 2018
From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn)
Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00)
Subject: [maker-devel] Some transcripts have no AED?
Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>

Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/9bd868f4/attachment-0002.html>

From d.ence at ufl.edu  Wed Mar 14 05:33:01 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Wed, 14 Mar 2018 11:33:01 +0000
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu>

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/74e08034/attachment-0002.html>

From seoanezonjic at hotmail.com  Wed Mar 14 07:52:12 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Wed, 14 Mar 2018 13:52:12 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <DB6PR0102MB2709EA5CAB5E8F5B46FDA541D1D10@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times:

substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850.
--> rank=15, hostname=dx095
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Sosen1_s1284
ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Sosen1_s1284

Can you help me to fix this problem?
Thank you in advance
Pedro Seoane

________________________________
De: p sz <seoanezonjic at hotmail.com>
Enviado: martes, 6 de marzo de 2018 9:30
Para: maker-devel at yandell-lab.org
Asunto: Problems with failed contigs

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/43255060/attachment-0002.html>

From vsoza at uw.edu  Wed Mar 14 18:21:26 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 14 Mar 2018 17:21:26 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log and
 all.gff files
Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>

Hi MAKER community

I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.

In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.

To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.

$ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
12024   12024  313247

3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.

$ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
12026   12026  313295

1 finished scaffold missing from this file is LG08_unordered_scaffold_90.

I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 

After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.

I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.

Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From d.ence at ufl.edu  Thu Mar 15 07:15:00 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Thu, 15 Mar 2018 13:15:00 +0000
Subject: [maker-devel] Fwd:  Some transcripts have no AED?
References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu>
Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu>


Begin forwarded message:

From: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Subject: Re: [maker-devel] Some transcripts have no AED?
Date: March 15, 2018 at 9:06:45 AM EDT
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>

Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results.

~Daniel


On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn> wrote:


Hi,

   I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much!

   Best wishes.


                                                                                                                                                                               Wang


-----Original Messages-----
From:"Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Sent Time:2018-03-14 19:33:01 (Wednesday)
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] Some transcripts have no AED?

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/fd2e8a08/attachment-0002.html>

From carsonhh at gmail.com  Thu Mar 15 08:57:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 08:57:37 -0600
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

> On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com> wrote:
> 
> Hi
> Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
> STARTED:3890
> FINISHED:3378
> So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
> substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
> and near this line, the following:
> ERROR: Failed while annotating transcripts
> My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? 
> Thanks in advance
> 
> 
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/cfdb76f6/attachment-0002.html>

From carsonhh at gmail.com  Thu Mar 15 09:15:09 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:15:09 -0600
Subject: [maker-devel] Using PASA assemblies with MAKER
In-Reply-To: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
References: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com>

MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>

I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3.

?Carson


> On Mar 9, 2018, at 9:15 AM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3:
> 
> 1. PASA assemblies in FASTA format (est)
> 2. PASA assembly structures (est_gff)
> 3. ORFs from PASA assemblies (protein)
> 
> And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below.
> 
> "ERROR: Non-unique top level ID for..."
> 
> I suppose all the non-unique IDs need to be renamed for MAKER?
> 
> Any help is greatly appreciated.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/6bd04db1/attachment-0002.html>

From carsonhh at gmail.com  Thu Mar 15 09:20:08 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:20:08 -0600
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <B9C2C216-B995-4478-91EE-3DBDF7A7F112@gmail.com>

Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence.

Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes).

Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values.

You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. 

pred_stats=0 #report AED and QI statistics for all predictions as well as models

?Carson

> On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote:
> 
> Hi,
> 
>    When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.
> 
>    Best wishes.
> 
> 
> 
>                                                                                                                                                                Wang
> 
> T
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/01a5c3ac/attachment-0002.html>

From carsonhh at gmail.com  Thu Mar 15 09:26:26 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:26:26 -0600
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
Message-ID: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>

If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.

You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).

?Carson


> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
> 
> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
> 
> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
> 
> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
> 12024   12024  313247
> 
> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
> 
> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
> 12026   12026  313295
> 
> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
> 
> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
> 
> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
> 
> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
> 
> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Mar 15 09:31:31 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:31:31 -0600
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>

You will just have to find and concatenate the files yourself.

Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta

?Carson


> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
> 
> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
> 
> Thanks for any help or insights.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From vsoza at uw.edu  Thu Mar 15 12:18:46 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 15 Mar 2018 11:18:46 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
	<A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu>

Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers.

-Valerie

> On Mar 15, 2018, at 8:26 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.
> 
> You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).
> 
> ?Carson
> 
> 
> 
>> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
>> 
>> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
>> 
>> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
>> 
>> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
>> 12024   12024  313247
>> 
>> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
>> 
>> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>> 12026   12026  313295
>> 
>> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
>> 
>> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
>> 
>> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
>> 
>> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
>> 
>> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Fri Mar 16 03:33:28 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Fri, 16 Mar 2018 09:33:28 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>,
	<2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
Message-ID: <DB6PR0102MB270989C66E53B92567ED2226D1D00@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that  the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me?
Thank you in advance
________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Thursday, March 15, 2018 2:57:37 PM
To: p sz
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Problems with failed contigs

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com<mailto:seoanezonjic at hotmail.com>> wrote:

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180316/5f91aa0a/attachment-0002.html>

From vsoza at uw.edu  Tue Mar 20 18:48:09 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 20 Mar 2018 17:48:09 -0700
Subject: [maker-devel] clarification on creating a standard build
Message-ID: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>

Hi MAKER community

I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.

I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
"One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?

Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 

What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From urmi208 at gmail.com  Wed Mar 21 03:05:42 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:05:42 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
Message-ID: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>

Hello maker community,

I am trying to run maker 3.01.02-beta on a fungal genome. I am using
available EST and protein sequences from a different strain of the same
species using parameters "est" and "protein" in the maker_opts.ctl file.
Here is the protocol I am using:

   1. Run maker with repeat masking and providing transcript and protein
   sequences from related species (Run A)
   2. Create SNAP model with CEGMA
   3. Train Augustus with BUSCO
   4. Run (run B ) with the new SNAP (done at step 2) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3)
   5. Create SNAP model from run B.
   6. Train Augustus with transcripts from run B and BUSCO
   7. Run (run C ) with the new SNAP (done at step 5) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3), keep_preds=1

As a result of this, I get following gene numbers:

   - run A: 12796 total genes out of which 12771 have AED < 0.5
   - run B:10713 total genes out of which 10701 have AED < 0.5
   - run C: 12651 total genes out of which 12582 have AED < 0.5

Looking at the gff files in detail, it is observerd that there are some
gene models in run A which are lost in run B and gain in run C. I don't
understand why there is gene loss for run B. Here is an example:

*RunA*

contig1 maker   gene    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
>
> contig1 maker   mRNA    20468   21193   100     +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 blastn  expressed_sequence_match        20468   21193   726     +
>>      .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>> target_length=726
>
> contig1 blastn  match_part      20468   21193   726     +       .
>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
> contig1 est2genome      expressed_sequence_match        20468   21193
>>  3630    +       .
>>  ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
>
> contig1 est2genome      match_part      20468   21193   3630    +       .
>>
>>  ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunB:*

> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunC: *

> contig1 maker   gene    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
>
> contig1 maker   mRNA    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 snap_masked     match   20468   21193   42.956  +       .
>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
>
> contig1 snap_masked     match_part      20468   21193   42.956  +       .
>>
>>  ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1
>> 1 726 +;Gap=M726
>
> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
Please could anyone shed come light on this?


Many thanks in advance.

Urmi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180321/59cc4c6c/attachment-0002.html>

From urmi208 at gmail.com  Wed Mar 21 03:24:32 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:24:32 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <CAGe_+EsxV0KWrbYt01RMOzBDSvUBGMmaMu=c4t4ubfWPmsuEyQ@mail.gmail.com>

Further to this, I did run interproscan on all three runs and 100% of the
genes from all of them have protein domains found. I am confused which one
should I consider as the best annotation. I am sorry for so many questions
but I am very new to maker.

Thanks again for any help you could provide.

On Wed, Mar 21, 2018 at 9:05 AM, Urmi <urmi208 at gmail.com> wrote:

> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
>


-- 
"The only way of finding the limits of the possible is by going beyond them
into the impossible.*" **- Arthur C. Clarke*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180321/c0cc3e5d/attachment-0002.html>

From carsonhh at gmail.com  Fri Mar 23 11:20:22 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:20:22 -0600
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>

You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.

All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.

You then have two alternate ways to get those models into your dataset.

1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.

That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.

2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.

This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.

?Carson


> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
> 
> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
> 
> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
> 
> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Fri Mar 23 11:28:50 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:28:50 -0600
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>

Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models.
Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity)
Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss).

Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html

?Carson


> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
> 
> Hello maker community,
> 
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using:
> 
> Run maker with repeat masking and providing transcript and protein sequences from related species (Run A)
> Create SNAP model with CEGMA
> Train Augustus with BUSCO
> Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3)
> Create SNAP model from run B.
> Train Augustus with transcripts from run B and BUSCO
> Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1
> As a result of this, I get following gene numbers:
> 
> run A: 12796 total genes out of which 12771 have AED < 0.5
> run B:10713 total genes out of which 10701 have AED < 0.5
> run C: 12651 total genes out of which 12582 have AED < 0.5
> Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example:
> 
> RunA
> 
> contig1 maker   gene    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
> contig1 maker   mRNA    20468   21193   100     +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
> contig1 maker   exon    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 blastn  expressed_sequence_match        20468   21193   726     +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726
> contig1 blastn  match_part      20468   21193   726     +       .       ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> contig1 est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
> contig1 est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunB:
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunC: 
> contig1 maker   gene    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
> contig1 maker   mRNA    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
> contig1 maker   exon    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 snap_masked     match   20468   21193   42.956  +       .       ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
> contig1 snap_masked     match_part      20468   21193   42.956  +       .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> Please could anyone shed come light on this?
> 
> 
> Many thanks in advance.
> 
> Urmi
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180323/bcfd8abc/attachment-0002.html>

From urmi208 at gmail.com  Mon Mar 26 01:28:21 2018
From: urmi208 at gmail.com (Urmi)
Date: Mon, 26 Mar 2018 08:28:21 +0100
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
	<7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
Message-ID: <CAGe_+EuGU0P4OdHR2cxvNSAKQN24FvW3-9YEFv70uNvDZYxVmQ@mail.gmail.com>

That's great! Thanks for the tips Carson.

Urmi

On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Run A ?> no gene prediction, just cut and paste of transcript/protein
> alignments to generate rough models.
> Run B ?> Gene predictions based on training using only highly conserved
> subset of genes (you will have low sensitivity)
> Run C ?> Gene predictions based on training using broader gene set. Higher
> sensitivity but potentially lower specificity (sensitivity gains should
> outweigh any specificity loss).
>
> Finally, mnake sure you look at models in a browser to see how well
> evidence and models overlap. If gene fusion is an issue (falsely merged
> mRNA-seq assembly results will generate hints that can cause gene
> predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/
> defusion/installation.html
>
> ?Carson
>
>
>
> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
>
> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180326/a27dbeb1/attachment-0002.html>

From vsoza at uw.edu  Mon Mar 26 12:49:24 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Mon, 26 Mar 2018 11:49:24 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>

Hi Carson

Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.

I created the .gff file by this command:
gff3_merge -d Rwill7_master_datastore_index.log

I created the .fasta files by this command:
fasta_merge -d Rwill7_master_datastore_index.log

I ran InterProScan with this command:
interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta

When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
 
$ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv

snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
#no results

There is no "processed-gene" with this ID in the Rwill7.all.gff file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff

LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1

However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:

$ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff

#some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?

LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235

So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:

$ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
#no results using the ?abinit-gene? Name from the .gff file

versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356

I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?

If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?

Thanks for your help.

-Valerie

> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
> 
> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
> 
> You then have two alternate ways to get those models into your dataset.
> 
> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
> 
> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
> 
> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
> 
> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
> 
> ?Carson
> 
> 
> 
>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>> 
>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>> 
>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>> 
>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Tue Mar 27 10:50:38 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 27 Mar 2018 09:50:38 -0700
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
	<15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu>

Hi Carson

Thanks, that is simple and it worked.

I did the following to sort and concatenate the query.masked.fasta files into one fasta:

$ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta

-Valerie

> On Mar 15, 2018, at 8:31 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will just have to find and concatenate the files yourself.
> 
> Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta
> 
> ?Carson
> 
> 
>> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
>> 
>> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
>> 
>> Thanks for any help or insights.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Thu Mar 29 12:42:28 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 29 Mar 2018 11:42:28 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
	<57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu>

Hi MAKER community,

I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file.

I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. 

sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

Then I extracted only the IDs from the .tsv file to grep against the all.gff file.

cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep.

sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :)

-Valerie


> On Mar 26, 2018, at 11:49 AM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Carson
> 
> Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.
> 
> I created the .gff file by this command:
> gff3_merge -d Rwill7_master_datastore_index.log
> 
> I created the .fasta files by this command:
> fasta_merge -d Rwill7_master_datastore_index.log
> 
> I ran InterProScan with this command:
> interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
> 
> $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv
> 
> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
> 7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
> #no results
> 
> There is no "processed-gene" with this ID in the Rwill7.all.gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff
> 
> LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
> LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
> LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> 
> However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff
> 
> #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?
> 
> LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
> LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
> LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235
> 
> So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> #no results using the ?abinit-gene? Name from the .gff file
> 
> versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356
> 
> I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?
> 
> If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?
> 
> Thanks for your help.
> 
> -Valerie
> 
>> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
>> 
>> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
>> 
>> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
>> 
>> You then have two alternate ways to get those models into your dataset.
>> 
>> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
>> 
>> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
>> 
>> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
>> 
>> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>>> 
>>> Hi MAKER community
>>> 
>>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>>> 
>>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>>> 
>>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>>> 
>>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>>> 
>>> Thanks.
>>> 
>>> -Valerie
>>> 
>>> Valerie Soza, Ph.D.
>>> c/o Hall Lab
>>> Department of Biology
>>> University of Washington
>>> Johnson Hall 202A
>>> Box 351800
>>> Seattle, WA 98195-1800
>>> 206-543-6740
>>> http://staff.washington.edu/vsoza/
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Tue Mar  6 02:30:24 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Tue, 6 Mar 2018 09:30:24 +0000
Subject: [maker-devel] Problems with failed contigs
Message-ID: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180306/c3737e92/attachment-0003.html>

From vsoza at uw.edu  Wed Mar  7 14:19:15 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 7 Mar 2018 13:19:15 -0800
Subject: [maker-devel] how to output masked genome from MAKER
Message-ID: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>

Hi MAKER community

I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?

I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 

Thanks for any help or insights.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From flopezo84 at gmail.com  Fri Mar  9 09:15:39 2018
From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=)
Date: Fri, 9 Mar 2018 11:15:39 -0500
Subject: [maker-devel] Using PASA assemblies with MAKER
Message-ID: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>

Hello,

I was wondering what might be the recommended option for using PASA2
alignment assemblies with MAKER3:

1. PASA assemblies in FASTA format (est)
2. PASA assembly structures (est_gff)
3. ORFs from PASA assemblies (protein)

And related to this question, when I use the PASA2 assembly structures in
GFF3 format, MAKER reports the error below.

"ERROR: Non-unique top level ID for..."

I suppose all the non-unique IDs need to be renamed for MAKER?

Any help is greatly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180309/6f987aaf/attachment-0003.html>

From wangzhennan at ioz.ac.cn  Tue Mar 13 21:53:44 2018
From: wangzhennan at ioz.ac.cn (wangzhennan at ioz.ac.cn)
Date: Wed, 14 Mar 2018 11:53:44 +0800 (GMT+08:00)
Subject: [maker-devel] Some transcripts have no AED?
Message-ID: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>

Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/9bd868f4/attachment-0003.html>

From d.ence at ufl.edu  Wed Mar 14 05:33:01 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Wed, 14 Mar 2018 11:33:01 +0000
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <00243144-1906-485F-B2CB-977FA2BBD161@ufl.edu>

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/74e08034/attachment-0003.html>

From seoanezonjic at hotmail.com  Wed Mar 14 07:52:12 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Wed, 14 Mar 2018 13:52:12 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <DB6PR0102MB2709EA5CAB5E8F5B46FDA541D1D10@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

Hi
I have tried to split the fasta file in small chunks and perform a execution for each chunk. Most of executions fail as I have described previously. I have inspect the results of a random chunk with 287 contigs and there are these error lines repeated 47 times:

substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850.
--> rank=15, hostname=dx095
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Sosen1_s1284
ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Sosen1_s1284

Can you help me to fix this problem?
Thank you in advance
Pedro Seoane

________________________________
De: p sz <seoanezonjic at hotmail.com>
Enviado: martes, 6 de marzo de 2018 9:30
Para: maker-devel at yandell-lab.org
Asunto: Problems with failed contigs

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180314/43255060/attachment-0003.html>

From vsoza at uw.edu  Wed Mar 14 18:21:26 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Wed, 14 Mar 2018 17:21:26 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log and
 all.gff files
Message-ID: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>

Hi MAKER community

I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.

In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.

To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.

$ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
12024   12024  313247

3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.

$ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
12026   12026  313295

1 finished scaffold missing from this file is LG08_unordered_scaffold_90.

I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 

After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.

I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.

Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From d.ence at ufl.edu  Thu Mar 15 07:15:00 2018
From: d.ence at ufl.edu (Ence,daniel)
Date: Thu, 15 Mar 2018 13:15:00 +0000
Subject: [maker-devel] Fwd:  Some transcripts have no AED?
References: <8E389D32-28DE-4FE7-A455-15786C1B2EAF@mail.ufl.edu>
Message-ID: <8AE0B9E6-2496-43AE-BC79-9FDDD2A0CFD3@mail.ufl.edu>


Begin forwarded message:

From: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Subject: Re: [maker-devel] Some transcripts have no AED?
Date: March 15, 2018 at 9:06:45 AM EDT
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>

Hi, I really can?t help you without more information to answer either of your questions (number of genes or the transcripts without AEDs). If you ran maker with some evidence to annotation those 23000 genes, then some of those genes were probably not supported by the evidence. Did you give those 23000 genes as predictions or as models? Those are two different options in maker and would give different results.

~Daniel


On Mar 15, 2018, at 4:11 AM, wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn> wrote:


Hi,

   I am so sorry that I can not expound my question for you. I run maker with a annotation file(gff) which has 23000 genes, but I got a result with only 21553 genes. Why there were fewer genes than original file? Thank you very much!

   Best wishes.


                                                                                                                                                                               Wang


-----Original Messages-----
From:"Ence,daniel" <d.ence at ufl.edu<mailto:d.ence at ufl.edu>>
Sent Time:2018-03-14 19:33:01 (Wednesday)
To: "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>>
Cc: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] Some transcripts have no AED?

Hi, can you send a few lines of examples? Do some transcripts do have AEDs?

~Daniel

Sent from my iPhone

On Mar 13, 2018, at 23:54, "wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>" <wangzhennan at ioz.ac.cn<mailto:wangzhennan at ioz.ac.cn>> wrote:


Hi,

   When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.

   Best wishes.


                                                                                                                                                               Wang

T

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__box290.bluehost.com_mailman_listinfo_maker-2Ddevel-5Fyandell-2Dlab.org&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=12jzlNvGVD0AlPJ4E7cTlw1Dvu6n9cb4kMCobJ28XPs&m=iQmWCVSvETCb_7VoClUCgmdyk2786LmVRNPvJUlyfbU&s=R9lycx5lwMU4QaJuAZevpAdW8qF801qKJl98hBHq4IQ&e=


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/fd2e8a08/attachment-0003.html>

From carsonhh at gmail.com  Thu Mar 15 08:57:37 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 08:57:37 -0600
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>
Message-ID: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

> On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com> wrote:
> 
> Hi
> Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
> STARTED:3890
> FINISHED:3378
> So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
> substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
> and near this line, the following:
> ERROR: Failed while annotating transcripts
> My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze? 
> Thanks in advance
> 
> 
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/cfdb76f6/attachment-0003.html>

From carsonhh at gmail.com  Thu Mar 15 09:15:09 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:15:09 -0600
Subject: [maker-devel] Using PASA assemblies with MAKER
In-Reply-To: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
References: <CAEW5o_M2KdOKnkB0KpVhDSTvMTBoqdGiimPpVHBoUCFWhbU+JA@mail.gmail.com>
Message-ID: <8542B83E-159A-47CA-A801-364386E0E2D2@gmail.com>

MAKER requires the two level match/match_part format. An example of which can be found in the GFF3 specification ?> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>

I haven?t used PASA, but the file you are providing may be GFF2 or GTF format which is not compatible with GFF3.

?Carson


> On Mar 9, 2018, at 9:15 AM, Federico L?pez <flopezo84 at gmail.com> wrote:
> 
> Hello,
> 
> I was wondering what might be the recommended option for using PASA2 alignment assemblies with MAKER3:
> 
> 1. PASA assemblies in FASTA format (est)
> 2. PASA assembly structures (est_gff)
> 3. ORFs from PASA assemblies (protein)
> 
> And related to this question, when I use the PASA2 assembly structures in GFF3 format, MAKER reports the error below.
> 
> "ERROR: Non-unique top level ID for..."
> 
> I suppose all the non-unique IDs need to be renamed for MAKER?
> 
> Any help is greatly appreciated.
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/6bd04db1/attachment-0003.html>

From carsonhh at gmail.com  Thu Mar 15 09:20:08 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:20:08 -0600
Subject: [maker-devel] Some transcripts have no AED?
In-Reply-To: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
References: <59a852e2.418b6.16222a46d59.Coremail.wangzhennan@ioz.ac.cn>
Message-ID: <B9C2C216-B995-4478-91EE-3DBDF7A7F112@gmail.com>

Do they have no AED or an AED value of 0. A value of 0 means a perfect match to the evidence.

Also make sure you are not looking for AED in the predictions (i.e. GFF3 source column snap/augustus that are of type match/match_part). Those are rejected models and will not have quality statistics calculated for them (they are only there as alignments for reference purposes).

Only models with the source column of ?maker? and a type of 'gene/mRNA/exon/CDS? represent the gene models and will have AED and QI values.

You can force MAKER to also calculate quality statistics for rejected models by altering pred_stats in the maker_opts.ctl file. 

pred_stats=0 #report AED and QI statistics for all predictions as well as models

?Carson

> On Mar 13, 2018, at 9:53 PM, wangzhennan at ioz.ac.cn wrote:
> 
> Hi,
> 
>    When I used maker to get the AED value, some transcripts have no AED values in the result. Why? I used the RNA-Seq data and protein to run maker.Thank you.
> 
>    Best wishes.
> 
> 
> 
>                                                                                                                                                                Wang
> 
> T
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180315/01a5c3ac/attachment-0003.html>

From carsonhh at gmail.com  Thu Mar 15 09:26:26 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:26:26 -0600
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
Message-ID: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>

If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.

You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).

?Carson


> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
> 
> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
> 
> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
> 
> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
> 12024   12024  313247
> 
> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
> 
> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
> 12026   12026  313295
> 
> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
> 
> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
> 
> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
> 
> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
> 
> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Thu Mar 15 09:31:31 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 15 Mar 2018 09:31:31 -0600
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
Message-ID: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>

You will just have to find and concatenate the files yourself.

Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta

?Carson


> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
> 
> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
> 
> Thanks for any help or insights.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From vsoza at uw.edu  Thu Mar 15 12:18:46 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 15 Mar 2018 11:18:46 -0700
Subject: [maker-devel] scaffolds missing from master_datastore_index.log
 and all.gff files
In-Reply-To: <A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
References: <6F2917B6-12D5-41D6-BD12-CDD7FDF0F5C8@uw.edu>
	<A63D1D3B-F5D7-4F19-B5CD-B8D941978586@gmail.com>
Message-ID: <53B85802-8AB9-4DB4-AA8A-137DD72AF4D7@uw.edu>

Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers.

-Valerie

> On Mar 15, 2018, at 8:26 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.
> 
> You can delete the log and then run ?maker -dsindex? to rebuild it which a single maker process (takes less than 5 minutes ).
> 
> ?Carson
> 
> 
> 
>> On Mar 14, 2018, at 6:21 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
>> 
>> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
>> 
>> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
>> 
>> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc 
>> 12024   12024  313247
>> 
>> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
>> 
>> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>> 12026   12026  313295
>> 
>> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
>> 
>> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory. 
>> 
>> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
>> 
>> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
>> 
>> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From seoanezonjic at hotmail.com  Fri Mar 16 03:33:28 2018
From: seoanezonjic at hotmail.com (p sz)
Date: Fri, 16 Mar 2018 09:33:28 +0000
Subject: [maker-devel] Problems with failed contigs
In-Reply-To: <2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
References: <DB6PR0102MB27095AC27BB34FF5B0D7C1DED1D90@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>,
	<2F06777F-4BED-4CC8-B4F6-0B78D3EA9405@gmail.com>
Message-ID: <DB6PR0102MB270989C66E53B92567ED2226D1D00@DB6PR0102MB2709.eurprd01.prod.exchangelabs.com>

I checked the maker version and it's the 3.01.02 that you suggest. I currently running maker for training SNAP (first round) so I'm not using gff files as input. Mi input files are all in fasta format, set in the genome, est and protein variables. In fact, one error line, saids that  the transcript analysis breaks the execution. I use the 2.2.30+ blast versi?n, maybe I have to use a newer version. Which version do you suggest me?
Thank you in advance
________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Thursday, March 15, 2018 2:57:37 PM
To: p sz
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Problems with failed contigs

First make sure you are using the most current version (3.01.02).

If you are providing a GFF3 issue, it may be an issue with one or more of the features you are giving. Finally, there is a possibility this is a BLAST report issue, as there is a BLAST truncation bug that appears to be fixed in NCBI BLAST 2.7.1+

?Carson

On Mar 6, 2018, at 2:30 AM, p sz <seoanezonjic at hotmail.com<mailto:seoanezonjic at hotmail.com>> wrote:

Hi
Currently i'm annotating a fish genome using the Maker 3 version and the execution don't finish properly. I execute Maker in a shared cluster trough MPI, asking for 96 cores (in order to not block the maker manager nor the file system) in 6 nodes of 16 cores each one. The work run 3 or 4 days and the output suddenly stops. The work remains 3 days more, but the index datastore is not modified and the only files that their attributes change are the lock files. When the job exceed the working time (7 days) it is cancelled. Then, I inspect the datastore index and count the uniq started and finished tags for each scaffold. The results are:
STARTED:3890
FINISHED:3378
So, there are about 500 scaffolds in which the analysis fails. When I inspect the STDERR maker output I see this error line repeated ~2000 times, at differents sites:
substr outside of string at /mnt/home/users/pab_001_uma/pedro/software/maker/bin/../lib/PhatHit_utils.pm line 850
and near this line, the following:
ERROR: Failed while annotating transcripts
My questions are, what can I do to annotate the failed scaffolds? and in addition, is the great amount of failed analysis the reason of the job freeze?
Thanks in advance


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180316/5f91aa0a/attachment-0003.html>

From vsoza at uw.edu  Tue Mar 20 18:48:09 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 20 Mar 2018 17:48:09 -0700
Subject: [maker-devel] clarification on creating a standard build
Message-ID: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>

Hi MAKER community

I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.

I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
"One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?

Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 

What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From urmi208 at gmail.com  Wed Mar 21 03:05:42 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:05:42 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
Message-ID: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>

Hello maker community,

I am trying to run maker 3.01.02-beta on a fungal genome. I am using
available EST and protein sequences from a different strain of the same
species using parameters "est" and "protein" in the maker_opts.ctl file.
Here is the protocol I am using:

   1. Run maker with repeat masking and providing transcript and protein
   sequences from related species (Run A)
   2. Create SNAP model with CEGMA
   3. Train Augustus with BUSCO
   4. Run (run B ) with the new SNAP (done at step 2) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3)
   5. Create SNAP model from run B.
   6. Train Augustus with transcripts from run B and BUSCO
   7. Run (run C ) with the new SNAP (done at step 5) and augustus species
   with options turned off (est2genome=0) and (protein2genome=0) data, provide
   gff file (altest_gff=runA_cdna2genome.gff,
   protein_gff=runA_protein2genome.gff3), keep_preds=1

As a result of this, I get following gene numbers:

   - run A: 12796 total genes out of which 12771 have AED < 0.5
   - run B:10713 total genes out of which 10701 have AED < 0.5
   - run C: 12651 total genes out of which 12582 have AED < 0.5

Looking at the gff files in detail, it is observerd that there are some
gene models in run A which are lost in run B and gain in run C. I don't
understand why there is gene loss for run B. Here is an example:

*RunA*

contig1 maker   gene    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
>
> contig1 maker   mRNA    20468   21193   100     +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>
> contig1 blastn  expressed_sequence_match        20468   21193   726     +
>>      .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>> target_length=726
>
> contig1 blastn  match_part      20468   21193   726     +       .
>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
> contig1 est2genome      expressed_sequence_match        20468   21193
>>  3630    +       .
>>  ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
>
> contig1 est2genome      match_part      20468   21193   3630    +       .
>>
>>  ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunB:*

> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
*RunC: *

> contig1 maker   gene    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
>
> contig1 maker   mRNA    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>
> contig1 maker   exon    20468   21193   .       +       .
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 maker   CDS     20468   21193   .       +       0
>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>
> contig1 snap_masked     match   20468   21193   42.956  +       .
>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
>
> contig1 snap_masked     match_part      20468   21193   42.956  +       .
>>
>>  ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1
>> 1 726 +;Gap=M726
>
> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>  21193   3630    +       .
>>  ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
>
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +
>>      .
>>  ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est
>> 1 726 +;Gap=M726
>
>
Please could anyone shed come light on this?


Many thanks in advance.

Urmi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180321/59cc4c6c/attachment-0003.html>

From urmi208 at gmail.com  Wed Mar 21 03:24:32 2018
From: urmi208 at gmail.com (Urmi)
Date: Wed, 21 Mar 2018 09:24:32 +0000
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <CAGe_+EsxV0KWrbYt01RMOzBDSvUBGMmaMu=c4t4ubfWPmsuEyQ@mail.gmail.com>

Further to this, I did run interproscan on all three runs and 100% of the
genes from all of them have protein domains found. I am confused which one
should I consider as the best annotation. I am sorry for so many questions
but I am very new to maker.

Thanks again for any help you could provide.

On Wed, Mar 21, 2018 at 9:05 AM, Urmi <urmi208 at gmail.com> wrote:

> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
>


-- 
"The only way of finding the limits of the possible is by going beyond them
into the impossible.*" **- Arthur C. Clarke*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180321/c0cc3e5d/attachment-0003.html>

From carsonhh at gmail.com  Fri Mar 23 11:20:22 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:20:22 -0600
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
Message-ID: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>

You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.

All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.

You then have two alternate ways to get those models into your dataset.

1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.

That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.

2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.

This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.

?Carson


> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi MAKER community
> 
> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
> 
> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
> 
> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
> 
> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
> 
> Thanks.
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Fri Mar 23 11:28:50 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 23 Mar 2018 11:28:50 -0600
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
Message-ID: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>

Run A ?> no gene prediction, just cut and paste of transcript/protein alignments to generate rough models.
Run B ?> Gene predictions based on training using only highly conserved subset of genes (you will have low sensitivity)
Run C ?> Gene predictions based on training using broader gene set. Higher sensitivity but potentially lower specificity (sensitivity gains should outweigh any specificity loss).

Finally, mnake sure you look at models in a browser to see how well evidence and models overlap. If gene fusion is an issue (falsely merged mRNA-seq assembly results will generate hints that can cause gene predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/defusion/installation.html

?Carson


> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
> 
> Hello maker community,
> 
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using available EST and protein sequences from a different strain of the same species using parameters "est" and "protein" in the maker_opts.ctl file. Here is the protocol I am using:
> 
> Run maker with repeat masking and providing transcript and protein sequences from related species (Run A)
> Create SNAP model with CEGMA
> Train Augustus with BUSCO
> Run (run B ) with the new SNAP (done at step 2) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3)
> Create SNAP model from run B.
> Train Augustus with transcripts from run B and BUSCO
> Run (run C ) with the new SNAP (done at step 5) and augustus species with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3), keep_preds=1
> As a result of this, I get following gene numbers:
> 
> run A: 12796 total genes out of which 12771 have AED < 0.5
> run B:10713 total genes out of which 10701 have AED < 0.5
> run C: 12651 total genes out of which 12582 have AED < 0.5
> Looking at the gff files in detail, it is observerd that there are some gene models in run A which are lost in run B and gain in run C. I don't understand why there is gene loss for run B. Here is an example:
> 
> RunA
> 
> contig1 maker   gene    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34
> contig1 maker   mRNA    20468   21193   100     +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34;Name=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
> contig1 maker   exon    20468   21193   .       +       .       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
> contig1 blastn  expressed_sequence_match        20468   21193   726     +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est target_length=726
> contig1 blastn  match_part      20468   21193   726     +       .       ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> contig1 est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1022:3.2.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100
> contig1 est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunB:
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> RunC: 
> contig1 maker   gene    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5
> contig1 maker   mRNA    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
> contig1 maker   exon    20468   21193   .       +       .       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 maker   CDS     20468   21193   .       +       0       ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
> contig1 snap_masked     match   20468   21193   42.956  +       .       ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-abinit-gene-0.5-mRNA-1;target_length=4075195
> contig1 snap_masked     match_part      20468   21193   42.956  +       .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
> contig1 est_gff:est2genome      expressed_sequence_match        20468   21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;Name=jgi|test_1|140804|est;target_length=726;aligned_coverage=100;aligned_identity=100;aligned_coverage=100;aligned_identity=100;score=3630;target_length=726
> contig1 est_gff:est2genome      match_part      20468   21193   3630    +       .       ID=contig1:hsp:1166:3.12.0.0;Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726 +;Gap=M726
> 
> Please could anyone shed come light on this?
> 
> 
> Many thanks in advance.
> 
> Urmi
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180323/bcfd8abc/attachment-0003.html>

From urmi208 at gmail.com  Mon Mar 26 01:28:21 2018
From: urmi208 at gmail.com (Urmi)
Date: Mon, 26 Mar 2018 08:28:21 +0100
Subject: [maker-devel] Gene loss in subsequent round of maker for fungal
 genome annotation
In-Reply-To: <7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
References: <CAGe_+EuV7r9wcw76B9STOFwzU3d5aifDP-Zp=KOZd7vZskZNfw@mail.gmail.com>
	<7BA23B5D-01FE-4766-B7B8-756E0385453D@gmail.com>
Message-ID: <CAGe_+EuGU0P4OdHR2cxvNSAKQN24FvW3-9YEFv70uNvDZYxVmQ@mail.gmail.com>

That's great! Thanks for the tips Carson.

Urmi

On Fri, Mar 23, 2018 at 5:28 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Run A ?> no gene prediction, just cut and paste of transcript/protein
> alignments to generate rough models.
> Run B ?> Gene predictions based on training using only highly conserved
> subset of genes (you will have low sensitivity)
> Run C ?> Gene predictions based on training using broader gene set. Higher
> sensitivity but potentially lower specificity (sensitivity gains should
> outweigh any specificity loss).
>
> Finally, mnake sure you look at models in a browser to see how well
> evidence and models overlap. If gene fusion is an issue (falsely merged
> mRNA-seq assembly results will generate hints that can cause gene
> predictors to fuse gene models), try deFusion ?> https://wjidea.github.io/
> defusion/installation.html
>
> ?Carson
>
>
>
> On Mar 21, 2018, at 3:05 AM, Urmi <urmi208 at gmail.com> wrote:
>
> Hello maker community,
>
> I am trying to run maker 3.01.02-beta on a fungal genome. I am using
> available EST and protein sequences from a different strain of the same
> species using parameters "est" and "protein" in the maker_opts.ctl file.
> Here is the protocol I am using:
>
>    1. Run maker with repeat masking and providing transcript and protein
>    sequences from related species (Run A)
>    2. Create SNAP model with CEGMA
>    3. Train Augustus with BUSCO
>    4. Run (run B ) with the new SNAP (done at step 2) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_
>    protein2genome.gff3)
>    5. Create SNAP model from run B.
>    6. Train Augustus with transcripts from run B and BUSCO
>    7. Run (run C ) with the new SNAP (done at step 5) and augustus
>    species with options turned off (est2genome=0) and (protein2genome=0) data,
>    provide gff file (altest_gff=runA_cdna2genome.gff, protein_gff=runA_protein2genome.gff3),
>    keep_preds=1
>
> As a result of this, I get following gene numbers:
>
>    - run A: 12796 total genes out of which 12771 have AED < 0.5
>    - run B:10713 total genes out of which 10701 have AED < 0.5
>    - run C: 12651 total genes out of which 12582 have AED < 0.5
>
> Looking at the gff files in detail, it is observerd that there are some
> gene models in run A which are lost in run B and gain in run C. I don't
> understand why there is gene loss for run B. Here is an example:
>
> *RunA*
>
> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34;Name=
>>> maker-contig1-exonerate_protein2genome-gene-0.34
>>
>> contig1 maker   mRNA    20468   21193   100     +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1;Parent=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34;Name=maker-contig1-exonerate_protein2genome-gene-
>>> 0.34-mRNA-1;_AED=0.30;_eAED=0.30;_QI=0|-1|0|1|-1|0|1|0|241
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:1;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-
>>> 1:cds;Parent=maker-contig1-exonerate_protein2genome-gene-0.34-mRNA-1
>>
>> contig1 blastn  expressed_sequence_match        20468   21193   726
>>>  +       .       ID=contig1:hit:983:3.2.0.0;Name=jgi|test_1|140804|est
>>> target_length=726
>>
>> contig1 blastn  match_part      20468   21193   726     +       .
>>>  ID=contig1:hsp:998:3.2.0.0;Parent=contig1:hit:983:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>> contig1 est2genome      expressed_sequence_match        20468   21193
>>>  3630    +       .       ID=contig1:hit:1022:3.2.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100
>>
>> contig1 est2genome      match_part      20468   21193   3630    +
>>>  .       ID=contig1:hsp:1110:3.2.0.0;Parent=contig1:hit:1022:3.2.0.0;Target=jgi|test_1|140804|est
>>> 1 726 +;Gap=M726
>>
>>
> *RunB:*
>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> *RunC: *
>
>> contig1 maker   gene    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5
>>
>> contig1 maker   mRNA    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1;
>>> Parent=snap_masked-contig1-processed-gene-0.5;Name=snap_
>>> masked-contig1-processed-gene-0.5-mRNA-1;_AED=0.30;_eAED=0.
>>> 30;_QI=0|-1|0|1|-1|1|1|0|241;_merge_warning=1
>>
>> contig1 maker   exon    20468   21193   .       +       .
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:1;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 maker   CDS     20468   21193   .       +       0
>>>  ID=snap_masked-contig1-processed-gene-0.5-mRNA-1:cds;
>>> Parent=snap_masked-contig1-processed-gene-0.5-mRNA-1
>>
>> contig1 snap_masked     match   20468   21193   42.956  +       .
>>>  ID=contig1:hit:5240:4.5.0.0;Name=snap_masked-contig1-
>>> abinit-gene-0.5-mRNA-1;target_length=4075195
>>
>> contig1 snap_masked     match_part      20468   21193   42.956  +
>>>  .       ID=contig1:hsp:12911:4.5.0.0;Parent=contig1:hit:5240:4.5.0.
>>> 0;Target=snap_masked-contig1-abinit-gene-0.5-mRNA-1 1 726 +;Gap=M726
>>
>> contig1 est_gff:est2genome      expressed_sequence_match        20468
>>>  21193   3630    +       .       ID=contig1:hit:1051:3.12.0.0;
>>> Name=jgi|test_1|140804|est;target_length=726;aligned_
>>> coverage=100;aligned_identity=100;aligned_coverage=100;
>>> aligned_identity=100;score=3630;target_length=726
>>
>> contig1 est_gff:est2genome      match_part      20468   21193   3630
>>> +       .       ID=contig1:hsp:1166:3.12.0.0;
>>> Parent=contig1:hit:1051:3.12.0.0;Target=jgi|test_1|140804|est 1 726
>>> +;Gap=M726
>>
>>
> Please could anyone shed come light on this?
>
>
> Many thanks in advance.
>
> Urmi
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180326/a27dbeb1/attachment-0003.html>

From vsoza at uw.edu  Mon Mar 26 12:49:24 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Mon, 26 Mar 2018 11:49:24 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
Message-ID: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>

Hi Carson

Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.

I created the .gff file by this command:
gff3_merge -d Rwill7_master_datastore_index.log

I created the .fasta files by this command:
fasta_merge -d Rwill7_master_datastore_index.log

I ran InterProScan with this command:
interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta

When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
 
$ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv

snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
#no results

There is no "processed-gene" with this ID in the Rwill7.all.gff file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff

LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1

However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:

$ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff

#some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?

LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235

So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:

$ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
#no results using the ?abinit-gene? Name from the .gff file

versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:

$ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356

I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?

If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?

Thanks for your help.

-Valerie

> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
> 
> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
> 
> You then have two alternate ways to get those models into your dataset.
> 
> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
> 
> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
> 
> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
> 
> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
> 
> ?Carson
> 
> 
> 
>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>> 
>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>> 
>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>> 
>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>> 
>> Thanks.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Tue Mar 27 10:50:38 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Tue, 27 Mar 2018 09:50:38 -0700
Subject: [maker-devel] how to output masked genome from MAKER
In-Reply-To: <15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
References: <FC69DF53-A058-4B3E-A612-BE8FC4652857@uw.edu>
	<15BF15D1-835F-4E8E-9B52-A958EF19BCD9@gmail.com>
Message-ID: <7264C880-3806-443D-9CC2-E70D4366CD8A@uw.edu>

Hi Carson

Thanks, that is simple and it worked.

I did the following to sort and concatenate the query.masked.fasta files into one fasta:

$ find Rwill7.maker.output -name 'query.masked.fasta' | sort -t "/" -k 5 | xargs cat > Rwill7.maker.assembly_masked.sorted.fasta

-Valerie

> On Mar 15, 2018, at 8:31 AM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> You will just have to find and concatenate the files yourself.
> 
> Something like ?> find assembly.maker.output -name 'query.masked.fasta' | xargs cat > assembly_masked.fasta
> 
> ?Carson
> 
> 
>> On Mar 7, 2018, at 2:19 PM, Valerie Soza <vsoza at uw.edu> wrote:
>> 
>> Hi MAKER community
>> 
>> I am wondering whether it is possible to get the entire masked genome generated by MAKER as an output. I think I have found it in pieces in the query.masked.fasta files generated for each scaffold by MAKER in theVoid directories. Is there a script that anyone has used to collate these files?
>> 
>> I am aware of the fast_merge script that comes bundled with MAKER, but it does not seem to collate the query.masked.fasta files. Could this perl script be modified to do this action? 
>> 
>> Thanks for any help or insights.
>> 
>> -Valerie
>> 
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


From vsoza at uw.edu  Thu Mar 29 12:42:28 2018
From: vsoza at uw.edu (Valerie Soza)
Date: Thu, 29 Mar 2018 11:42:28 -0700
Subject: [maker-devel] clarification on creating a standard build
In-Reply-To: <57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
References: <ADACE3F2-BD4C-43B9-B887-D25FC51F562B@uw.edu>
	<4D9CD374-D142-4387-A4D7-1FB7A8A74F48@gmail.com>
	<57B30565-1603-4723-AF74-FEB54F735899@uw.edu>
Message-ID: <624D4D7C-2BB3-4A1D-9088-116807492D2E@uw.edu>

Hi MAKER community,

I was having issues grepping IDs from the InterProScan .tsv file against my all.gff file because the abinit genes in the all.gff file are called processed genes in the .all.maker.non_overlapping_ab_initio.proteins.fasta file, which gets propagated into the .tsv file.

I solved this issue by replacing "processed" with "abinit" in the .tsv file and then grepping the all.gff file with these IDs to create pred_gff. 

sed s/\-processed\-/\-abinit\-/g Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

Then I extracted only the IDs from the .tsv file to grep against the all.gff file.

cut -f 1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

I was having non-uinque top level ID errors when I tried to run maker with pred_gff, and I realized that IDs were duplicated in the .tsv file. So I went back to my list of IDs and only extraced unique IDs to grep.

sort Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill7.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

Then I redid my grepping and maker ran beautifully. I tried both the options recommended by Carson, but then wound up going with option #2 cause I am lazy and now I have a standard build :)

-Valerie


> On Mar 26, 2018, at 11:49 AM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Carson
> 
> Thanks for the clarification on steps, but I am having issues with the first step of grepping the ID from the InterProScan .tsv file in my .gff file. I am getting no results of these IDs from the .gff file. I think the IDs used in the maker.non_overlapping_ab_initio.proteins.fasta are different from what is actually used in the .gff file. Please see my commands/results below.
> 
> I created the .gff file by this command:
> gff3_merge -d Rwill7_master_datastore_index.log
> 
> I created the .fasta files by this command:
> fasta_merge -d Rwill7_master_datastore_index.log
> 
> I ran InterProScan with this command:
> interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> When I try grepping the IDs in the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv from Rwill7.all.gff, I get nothing. See an example for one ID from the .tsv file below:
> 
> $ more Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv
> 
> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1	d146190e642a740520c9
> 7a782a74fe32	356	Pfam	PF13365	Trypsin-like peptidase domain	77	2281.4E-17	T	20-03-2018
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.gff
> #no results
> 
> There is no "processed-gene" with this ID in the Rwill7.all.gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene Rwill7.all.gff
> 
> LG12_ordered_scaffold_85	maker	gene	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8
> LG12_ordered_scaffold_85	maker	mRNA	63727	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8;Name=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1;_AED=0.81;_eAED=0.81;_QI=0|0|0|0|1|1|6|0|200
> LG12_ordered_scaffold_85	maker	exon	63727	63768	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42245;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64269	64340	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42246;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	64896	65000	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42247;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65268	65327	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42248;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	65716	65915	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42249;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	exon	66930	67053	.	+	.	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:exon:42250;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	63727	63768	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64269	64340	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	64896	65000	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65268	65327	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	65716	65915	.	+	0	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> LG12_ordered_scaffold_85	maker	CDS	66930	67053	.	+	1	ID=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1:cds;Parent=snap_masked-LG12_ordered_scaffold_85-processed-gene-0.8-mRNA-1
> 
> However, there are Names and Targets called "abinit-gene?, not "processed-gene?, that appear to have this gene number in the .gff file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85 Rwill7.all.gff
> 
> #some results from command above that might be snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1?
> 
> LG12_ordered_scaffold_85	snap_masked	match	101798	108141	35.366	+	ID=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Name=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1
> LG12_ordered_scaffold_85	snap_masked	match_part	101798	102633	35.236	ID=LG12_ordered_scaffold_85:hsp:1369290:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 1 836 +;Gap=M836
> LG12_ordered_scaffold_85	snap_masked	match_part	107907	108141	0.130	ID=LG12_ordered_scaffold_85:hsp:1369291:4.5.0.0;Parent=LG12_ordered_scaffold_85:hit:469677:4.5.0.0;Target=snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 837 1071 +;Gap=M235
> 
> So I looked at the maker.non_overlapping_ab_initio.proteins.fasta file to see if this ?abinit-gene? from the .gff file was present:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-abinit-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
> #no results using the ?abinit-gene? Name from the .gff file
> 
> versus using the ID from the .tsv file to grep against the maker.non_overlapping_ab_initio.proteins.fasta file:
> 
> $ grep snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta
>> snap_masked-LG12_ordered_scaffold_85-processed-gene-0.12-mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|2|0|356
> 
> I think the issue is that the Rwill7.all.maker.non_overlapping_ab_initio.proteins.fasta is calling these ?abinit-genes? in the .gff file as ?processed-genes? in the .fasta file, which is then propagated into the .tsv file. Is my interpretation correct?
> 
> If so, all I would have to do is grep the .gff file replacing ?processed? with ?abinit?, correct?
> 
> Thanks for your help.
> 
> -Valerie
> 
>> On Mar 23, 2018, at 10:20 AM, Carson Holt <carsonhh at gmail.com> wrote:
>> 
>> You will get the ID from the IntrepProscan report. Then you can take that ID and grep for it in the MAKER gff3.
>> 
>> All ab initio predictions have match/match_part features created for them the gff3. You can then take those non-gene match/match_part features and provide them to pred_gff.
>> 
>> You then have two alternate ways to get those models into your dataset.
>> 
>> 1. Do a second run with only that pred_gff (i.e. turn off all other MAKER options and blank out all evidence including repeat masking options) and set keep_preds=1.
>> 
>> That will simply take the pred_gff match/match_part values and turn them into a nicely formatted gene/mRNA/exon/CDS features together with associated fasta files. Those can then simply be merged into your current result using GFF3 merge.
>> 
>> 2. Provide maker_gff (set pred_pass=0 and all other pass options to 1), provide pred_gff, and set keep_preds=1.
>> 
>> This is the same as the previous run option, but MAKER will do the merging for you. But it will take longer since it will use the maker_gff to rebuild all models and evidence in memory and rescore everything.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Mar 20, 2018, at 6:48 PM, Valerie Soza <vsoza at uw.edu> wrote:
>>> 
>>> Hi MAKER community
>>> 
>>> I am trying to create a standard build as indicated in the Campbell et al. 2014 papers in Plant Physiology and Current Protocols in Bioinformatics. I was following the protocol as outlined in Current Protocols in Bioinformatics, but then came across this thread in the MAKER google forum: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter%7Csort:date/maker-devel/97aNJkT3bgk/mpL7V5QWAAAJ.
>>> 
>>> I can?t reply to this original thread, but I am trying to follow Carson?s suggestion for a standard build using this protocol instead now:
>>> "One note I?d like to make, is that doing a second round with keep_preds=1 is the wrong procedure (only do that if you really want to keep everything - i.e. in some fungi or oomycetes). Rather you should use InterProScan to evaluate the rejected models in the non-overlapping.abinit.proteins.fasta file, then grep the ones that have an IPR domain out of the GFF3 (will be match/match_part features) and then pass them to pred_gff in a separate run (just updates the format to gene/mRNA/exon/CDSwith proper reading frame). You can then merge the resulting GFF3's and fasta files.?
>>> 
>>> Instead of doing a second round of annotations with keep_preds=1, I am using my original annotations with keep_preds=0. I have used InterProScan on the non-overlapping.abinit.proteins.fasta. I am unclear as to what gff3 file to use to grep for genes with IPR domains from the non-overlapping.abinit.proteins.fasta file. Genes from the non-overlapping.abinit.proteins.fasta file are not in my .all.gff file created by the gff3_merge script. 
>>> 
>>> What gff3 file should I be using to resurrect proteins with IPR domains from the non-overlapping.abinit.proteins.fasta? Should I be doing an annotation with keep_preds=1 as well, and resurrecting genes with IPR domains from this gff3?
>>> 
>>> Thanks.
>>> 
>>> -Valerie
>>> 
>>> Valerie Soza, Ph.D.
>>> c/o Hall Lab
>>> Department of Biology
>>> University of Washington
>>> Johnson Hall 202A
>>> Box 351800
>>> Seattle, WA 98195-1800
>>> 206-543-6740
>>> http://staff.washington.edu/vsoza/
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/