From carsonhh at gmail.com  Tue Sep  4 17:51:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 4 Sep 2018 16:51:07 -0600
Subject: [maker-devel] Re-annotation of a previous annotation with
 "est2genome=1"
In-Reply-To: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
References: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
Message-ID: <C223A2A8-A837-405B-ADF2-BB55C66B2315@gmail.com>


> Is it possible to correct this mistake without starting from the begin? I start a new run with "est2genome=0" and using the previous gff output in several options, but it seems like it will take forever to finish.

If you run in the same directory as a previous run, it will reuse archived raw reports from blast, etc.


> Also, would it be necessary some filtering/edition in the "all.gff file" when put it in the options like "est_gff" and "rm_gff"?

You can try that, but you do lose some extra info that is in the raw alignment report and not in the GFF3. So it?s usually better to let MAKER do the alignment from fasta and only use GFF3 passthrough for datasets that you no longer have access to.

?Carson
 

From carson.holt at genetics.utah.edu  Tue Sep 11 11:18:00 2018
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Tue, 11 Sep 2018 16:18:00 +0000
Subject: [maker-devel] Plant and Animal Genome Conference 2019
Message-ID: <BE2CC0AF-5D29-4094-AF0F-12F96F3BEBA1@genetics..utah.edu>

Hello MAKER e-mail list,

I just wanted to let you know I am organizing the ?Next Generation Genome Annotation and Analysis? workshop at PAG in San Diego (Jan 12-16). If you are interested in presenting an annotation related tool or annotation project at PAG at this workshop, contact me directly with your presentation proposal. Projects do not need to be MAKER related, rather we like presenters to share their experience with genome annotation. This provides practical examples of annotation that can help other researchers who may be preparing for their own annotation projects and are looking for advice as well as tools.

Thanks,
Carson Holt


From anthony.bretaudeau at inria.fr  Tue Sep 25 07:10:13 2018
From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau)
Date: Tue, 25 Sep 2018 14:10:13 +0200
Subject: [maker-devel] Segfault with OpenMPI
Message-ID: <a6972262-9f1e-c13a-1e15-a23be898ded6@inria.fr>

An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180925/8e03bcb5/attachment.html>

From dandence at gmail.com  Fri Sep 28 12:21:35 2018
From: dandence at gmail.com (Daniel Ence)
Date: Fri, 28 Sep 2018 13:21:35 -0400
Subject: [maker-devel] NCBI now accepts GFF
Message-ID: <BD07F734-1645-4EF6-8E0B-4A9B7C34D4BB@gmail.com>

Hi all, NCBI now accepts genome annotations in gff format. https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/ <https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/>  No more converting to NCBI table format!

~Daniel Ence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1356 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment.p7s>

From liorglic at mail.tau.ac.il  Sun Sep 30 13:27:20 2018
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Sun, 30 Sep 2018 21:27:20 +0300
Subject: [maker-devel] Help debugging a MAKER result
Message-ID: <CAOzMDPyAXBYxnT_h7Hki2OZwSR=zOZJ2gc73EQPn26QNKh_Ceg@mail.gmail.com>

Hi MAKER users,
I am new to Maker and had just finished running my first annotations.
Although the results make sense in general, I have reasons to suspect some
gene models are wrong and would like your help in understanding and
optimizing the results.
My research project involves the annotation of multiple tomato varieties
(individuals) which are a bit different from the published reference
genome. To this end, I created de-novo assemblies of these genomes and also
generated an evidence set to be used as input for Maker. Evidence consist
of a large set of transcripts from various tomato varieties and conditions,
as well as full protein sets from 6 plant species, including the proteins
derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my
evidence data and Augustus as gene predictor. This should allow me to
compare my result to the ITAG annotation, which I assume to be the
"correct" answer, and see how well I'm doing. I should mention that ITAG
annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set.
Specifically, I ran an all-vs-all blast and took the top hits. I discovered
that only about 70% of the ITAG proteins are covered by a protein from my
result with a high quality alignment (evalue > 10e-5, coverage > 90%). I
further investigated by running BUSCO on both protein sets and looking at
BUSCOs found in ITAG but missing in my result. Attached is a screenshot
from a genome browser where you can see such a case. Top track is the ITAG
gene model, below is my result. Third track is the protein evidence
alignments (i.e blastx and protein2genome features), and bottom track are
masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult
case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my
result. This is in fact the reason I ended up with a truncated protein and
a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of
protein evidence supporting this region as a CDS. Can you help me figure
out why is the result so? Could it be due to the small repeats detected in
this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker.png
Type: image/png
Size: 30422 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment.png>

From carsonhh at gmail.com  Tue Sep  4 16:51:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 4 Sep 2018 16:51:07 -0600
Subject: [maker-devel] Re-annotation of a previous annotation with
 "est2genome=1"
In-Reply-To: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
References: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
Message-ID: <C223A2A8-A837-405B-ADF2-BB55C66B2315@gmail.com>


> Is it possible to correct this mistake without starting from the begin? I start a new run with "est2genome=0" and using the previous gff output in several options, but it seems like it will take forever to finish.

If you run in the same directory as a previous run, it will reuse archived raw reports from blast, etc.


> Also, would it be necessary some filtering/edition in the "all.gff file" when put it in the options like "est_gff" and "rm_gff"?

You can try that, but you do lose some extra info that is in the raw alignment report and not in the GFF3. So it?s usually better to let MAKER do the alignment from fasta and only use GFF3 passthrough for datasets that you no longer have access to.

?Carson
 

From carson.holt at genetics.utah.edu  Tue Sep 11 10:18:00 2018
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Tue, 11 Sep 2018 16:18:00 +0000
Subject: [maker-devel] Plant and Animal Genome Conference 2019
Message-ID: <BE2CC0AF-5D29-4094-AF0F-12F96F3BEBA1@genetics..utah.edu>

Hello MAKER e-mail list,

I just wanted to let you know I am organizing the ?Next Generation Genome Annotation and Analysis? workshop at PAG in San Diego (Jan 12-16). If you are interested in presenting an annotation related tool or annotation project at PAG at this workshop, contact me directly with your presentation proposal. Projects do not need to be MAKER related, rather we like presenters to share their experience with genome annotation. This provides practical examples of annotation that can help other researchers who may be preparing for their own annotation projects and are looking for advice as well as tools.

Thanks,
Carson Holt


From anthony.bretaudeau at inria.fr  Tue Sep 25 06:10:13 2018
From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau)
Date: Tue, 25 Sep 2018 14:10:13 +0200
Subject: [maker-devel] Segfault with OpenMPI
Message-ID: <a6972262-9f1e-c13a-1e15-a23be898ded6@inria.fr>

An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180925/8e03bcb5/attachment-0001.html>

From dandence at gmail.com  Fri Sep 28 11:21:35 2018
From: dandence at gmail.com (Daniel Ence)
Date: Fri, 28 Sep 2018 13:21:35 -0400
Subject: [maker-devel] NCBI now accepts GFF
Message-ID: <BD07F734-1645-4EF6-8E0B-4A9B7C34D4BB@gmail.com>

Hi all, NCBI now accepts genome annotations in gff format. https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/ <https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/>  No more converting to NCBI table format!

~Daniel Ence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1356 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment-0001.p7s>

From liorglic at mail.tau.ac.il  Sun Sep 30 12:27:20 2018
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Sun, 30 Sep 2018 21:27:20 +0300
Subject: [maker-devel] Help debugging a MAKER result
Message-ID: <CAOzMDPyAXBYxnT_h7Hki2OZwSR=zOZJ2gc73EQPn26QNKh_Ceg@mail.gmail.com>

Hi MAKER users,
I am new to Maker and had just finished running my first annotations.
Although the results make sense in general, I have reasons to suspect some
gene models are wrong and would like your help in understanding and
optimizing the results.
My research project involves the annotation of multiple tomato varieties
(individuals) which are a bit different from the published reference
genome. To this end, I created de-novo assemblies of these genomes and also
generated an evidence set to be used as input for Maker. Evidence consist
of a large set of transcripts from various tomato varieties and conditions,
as well as full protein sets from 6 plant species, including the proteins
derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my
evidence data and Augustus as gene predictor. This should allow me to
compare my result to the ITAG annotation, which I assume to be the
"correct" answer, and see how well I'm doing. I should mention that ITAG
annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set.
Specifically, I ran an all-vs-all blast and took the top hits. I discovered
that only about 70% of the ITAG proteins are covered by a protein from my
result with a high quality alignment (evalue > 10e-5, coverage > 90%). I
further investigated by running BUSCO on both protein sets and looking at
BUSCOs found in ITAG but missing in my result. Attached is a screenshot
from a genome browser where you can see such a case. Top track is the ITAG
gene model, below is my result. Third track is the protein evidence
alignments (i.e blastx and protein2genome features), and bottom track are
masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult
case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my
result. This is in fact the reason I ended up with a truncated protein and
a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of
protein evidence supporting this region as a CDS. Can you help me figure
out why is the result so? Could it be due to the small repeats detected in
this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker.png
Type: image/png
Size: 30422 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0001.png>

From carsonhh at gmail.com  Tue Sep  4 16:51:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 4 Sep 2018 16:51:07 -0600
Subject: [maker-devel] Re-annotation of a previous annotation with
 "est2genome=1"
In-Reply-To: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
References: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
Message-ID: <C223A2A8-A837-405B-ADF2-BB55C66B2315@gmail.com>


> Is it possible to correct this mistake without starting from the begin? I start a new run with "est2genome=0" and using the previous gff output in several options, but it seems like it will take forever to finish.

If you run in the same directory as a previous run, it will reuse archived raw reports from blast, etc.


> Also, would it be necessary some filtering/edition in the "all.gff file" when put it in the options like "est_gff" and "rm_gff"?

You can try that, but you do lose some extra info that is in the raw alignment report and not in the GFF3. So it?s usually better to let MAKER do the alignment from fasta and only use GFF3 passthrough for datasets that you no longer have access to.

?Carson
 

From carson.holt at genetics.utah.edu  Tue Sep 11 10:18:00 2018
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Tue, 11 Sep 2018 16:18:00 +0000
Subject: [maker-devel] Plant and Animal Genome Conference 2019
Message-ID: <BE2CC0AF-5D29-4094-AF0F-12F96F3BEBA1@genetics..utah.edu>

Hello MAKER e-mail list,

I just wanted to let you know I am organizing the ?Next Generation Genome Annotation and Analysis? workshop at PAG in San Diego (Jan 12-16). If you are interested in presenting an annotation related tool or annotation project at PAG at this workshop, contact me directly with your presentation proposal. Projects do not need to be MAKER related, rather we like presenters to share their experience with genome annotation. This provides practical examples of annotation that can help other researchers who may be preparing for their own annotation projects and are looking for advice as well as tools.

Thanks,
Carson Holt


From anthony.bretaudeau at inria.fr  Tue Sep 25 06:10:13 2018
From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau)
Date: Tue, 25 Sep 2018 14:10:13 +0200
Subject: [maker-devel] Segfault with OpenMPI
Message-ID: <a6972262-9f1e-c13a-1e15-a23be898ded6@inria.fr>

An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180925/8e03bcb5/attachment-0002.html>

From dandence at gmail.com  Fri Sep 28 11:21:35 2018
From: dandence at gmail.com (Daniel Ence)
Date: Fri, 28 Sep 2018 13:21:35 -0400
Subject: [maker-devel] NCBI now accepts GFF
Message-ID: <BD07F734-1645-4EF6-8E0B-4A9B7C34D4BB@gmail.com>

Hi all, NCBI now accepts genome annotations in gff format. https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/ <https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/>  No more converting to NCBI table format!

~Daniel Ence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1356 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment-0002.p7s>

From liorglic at mail.tau.ac.il  Sun Sep 30 12:27:20 2018
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Sun, 30 Sep 2018 21:27:20 +0300
Subject: [maker-devel] Help debugging a MAKER result
Message-ID: <CAOzMDPyAXBYxnT_h7Hki2OZwSR=zOZJ2gc73EQPn26QNKh_Ceg@mail.gmail.com>

Hi MAKER users,
I am new to Maker and had just finished running my first annotations.
Although the results make sense in general, I have reasons to suspect some
gene models are wrong and would like your help in understanding and
optimizing the results.
My research project involves the annotation of multiple tomato varieties
(individuals) which are a bit different from the published reference
genome. To this end, I created de-novo assemblies of these genomes and also
generated an evidence set to be used as input for Maker. Evidence consist
of a large set of transcripts from various tomato varieties and conditions,
as well as full protein sets from 6 plant species, including the proteins
derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my
evidence data and Augustus as gene predictor. This should allow me to
compare my result to the ITAG annotation, which I assume to be the
"correct" answer, and see how well I'm doing. I should mention that ITAG
annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set.
Specifically, I ran an all-vs-all blast and took the top hits. I discovered
that only about 70% of the ITAG proteins are covered by a protein from my
result with a high quality alignment (evalue > 10e-5, coverage > 90%). I
further investigated by running BUSCO on both protein sets and looking at
BUSCOs found in ITAG but missing in my result. Attached is a screenshot
from a genome browser where you can see such a case. Top track is the ITAG
gene model, below is my result. Third track is the protein evidence
alignments (i.e blastx and protein2genome features), and bottom track are
masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult
case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my
result. This is in fact the reason I ended up with a truncated protein and
a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of
protein evidence supporting this region as a CDS. Can you help me figure
out why is the result so? Could it be due to the small repeats detected in
this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker.png
Type: image/png
Size: 30422 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0002.png>

From carsonhh at gmail.com  Tue Sep  4 16:51:07 2018
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 4 Sep 2018 16:51:07 -0600
Subject: [maker-devel] Re-annotation of a previous annotation with
 "est2genome=1"
In-Reply-To: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
References: <CADVjsYLphckgPoDnFKd_OMh3+sNOdAAJGpu1Za9RMmXLPqsQQw@mail.gmail.com>
Message-ID: <C223A2A8-A837-405B-ADF2-BB55C66B2315@gmail.com>


> Is it possible to correct this mistake without starting from the begin? I start a new run with "est2genome=0" and using the previous gff output in several options, but it seems like it will take forever to finish.

If you run in the same directory as a previous run, it will reuse archived raw reports from blast, etc.


> Also, would it be necessary some filtering/edition in the "all.gff file" when put it in the options like "est_gff" and "rm_gff"?

You can try that, but you do lose some extra info that is in the raw alignment report and not in the GFF3. So it?s usually better to let MAKER do the alignment from fasta and only use GFF3 passthrough for datasets that you no longer have access to.

?Carson
 

From carson.holt at genetics.utah.edu  Tue Sep 11 10:18:00 2018
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Tue, 11 Sep 2018 16:18:00 +0000
Subject: [maker-devel] Plant and Animal Genome Conference 2019
Message-ID: <BE2CC0AF-5D29-4094-AF0F-12F96F3BEBA1@genetics..utah.edu>

Hello MAKER e-mail list,

I just wanted to let you know I am organizing the ?Next Generation Genome Annotation and Analysis? workshop at PAG in San Diego (Jan 12-16). If you are interested in presenting an annotation related tool or annotation project at PAG at this workshop, contact me directly with your presentation proposal. Projects do not need to be MAKER related, rather we like presenters to share their experience with genome annotation. This provides practical examples of annotation that can help other researchers who may be preparing for their own annotation projects and are looking for advice as well as tools.

Thanks,
Carson Holt


From anthony.bretaudeau at inria.fr  Tue Sep 25 06:10:13 2018
From: anthony.bretaudeau at inria.fr (Anthony Bretaudeau)
Date: Tue, 25 Sep 2018 14:10:13 +0200
Subject: [maker-devel] Segfault with OpenMPI
Message-ID: <a6972262-9f1e-c13a-1e15-a23be898ded6@inria.fr>

An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180925/8e03bcb5/attachment-0003.html>

From dandence at gmail.com  Fri Sep 28 11:21:35 2018
From: dandence at gmail.com (Daniel Ence)
Date: Fri, 28 Sep 2018 13:21:35 -0400
Subject: [maker-devel] NCBI now accepts GFF
Message-ID: <BD07F734-1645-4EF6-8E0B-4A9B7C34D4BB@gmail.com>

Hi all, NCBI now accepts genome annotations in gff format. https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/ <https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/>  No more converting to NCBI table format!

~Daniel Ence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1356 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180928/59151f83/attachment-0003.p7s>

From liorglic at mail.tau.ac.il  Sun Sep 30 12:27:20 2018
From: liorglic at mail.tau.ac.il (Lior Glick)
Date: Sun, 30 Sep 2018 21:27:20 +0300
Subject: [maker-devel] Help debugging a MAKER result
Message-ID: <CAOzMDPyAXBYxnT_h7Hki2OZwSR=zOZJ2gc73EQPn26QNKh_Ceg@mail.gmail.com>

Hi MAKER users,
I am new to Maker and had just finished running my first annotations.
Although the results make sense in general, I have reasons to suspect some
gene models are wrong and would like your help in understanding and
optimizing the results.
My research project involves the annotation of multiple tomato varieties
(individuals) which are a bit different from the published reference
genome. To this end, I created de-novo assemblies of these genomes and also
generated an evidence set to be used as input for Maker. Evidence consist
of a large set of transcripts from various tomato varieties and conditions,
as well as full protein sets from 6 plant species, including the proteins
derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my
evidence data and Augustus as gene predictor. This should allow me to
compare my result to the ITAG annotation, which I assume to be the
"correct" answer, and see how well I'm doing. I should mention that ITAG
annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set.
Specifically, I ran an all-vs-all blast and took the top hits. I discovered
that only about 70% of the ITAG proteins are covered by a protein from my
result with a high quality alignment (evalue > 10e-5, coverage > 90%). I
further investigated by running BUSCO on both protein sets and looking at
BUSCOs found in ITAG but missing in my result. Attached is a screenshot
from a genome browser where you can see such a case. Top track is the ITAG
gene model, below is my result. Third track is the protein evidence
alignments (i.e blastx and protein2genome features), and bottom track are
masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult
case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my
result. This is in fact the reason I ended up with a truncated protein and
a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of
protein evidence supporting this region as a CDS. Can you help me figure
out why is the result so? Could it be due to the small repeats detected in
this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker.png
Type: image/png
Size: 30422 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0003.png>