[maker-devel] About split genes in MAKER annotation

Prashant Narendra SHINGATE prashantns at imcb.a-star.edu.sg
Tue Dec 18 00:23:39 MST 2018


Hi Carson,

Thank you for the reply.

Please find attached the GFF file of Scaffold61 along with related ctl files. The coordinates of the gene I am referring to is from 698,453 to 840,581 bp. Several proteins, short and long are aligned to this locus including a 1200aa protein and a 2,069 aa (XP_022239675.1) protein. The latter is aligned completely to scaffold61 from 698,453 to 840,581 bp with >90% identity. Please see the related line to this alignment from GFF file.

SCAffold61      blastx  protein_match   698453  840581  730     -       .       ID=SCAffold61:hit:1861:3.10.0.0;Name=XP_022239675.1

However, in evidence-based run, this gene is split into two fragments. Please see the related lines from GFF file as follow:

SCAffold61      maker   gene    708052  717805  .       -       .       ID=maker-SCAffold61-exonerate_est2genome-gene-0.95;Name=maker-SCAffold61-exonerate_est2genome-gene-0.95
SCAffold61      maker   gene    748651  770415  .       -       .       ID=maker-SCAffold61-exonerate_est2genome-gene-0.96;Name=maker-SCAffold61-exonerate_est2genome-gene-0.96

It looks like Exonerate prediction is based only on ESTs which are fragmented and the full-length protein aligned to this locus is completely ignored. We have seen this type of priority for ESTs in other loci also resulting in split gene prediction (sometime 3 to 4 fragments) in spite of alignment of longer full-length proteins to the assembly. Our ESTs (Trinity assembled RNAseq transcripts) were generated from the same individual whose genome was sequenced (and hence the identify is close to 100%). If we align only proteins, Exonerate still splits the gene based on shorter proteins aligned to the locus.  I would really appreciate if you can help us to solve this splitting of genes despite alignment of full-length proteins to the assembly.

I will be glad to send all reference protein and transcript sequences used for annotation, if required.

Thanks for your time and help.

Best regards,

Prashant Shingate, PhD<mailto:prashantns at imcb.a-star.edu.sg> :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)
61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/
We advance science and develop innovative technology to further economic growth and improve lives.


From: Carson Holt [mailto:carsonhh at gmail.com]
Sent: Tuesday, 18 December, 2018 1:38 AM
To: Prashant Narendra SHINGATE
Cc: Byrappa VENKATESH; maker-devel at yandell-lab.org
Subject: Re: About split genes in MAKER annotation

It’s best to look at these in a browser like Apollo where you can also manipulate the intron/exon structure. What you will often find is that there is something that breaks the ORF or breaks splicing, so the predictors can’t build an end to end model even with the hints given. If you have a GFF3 just for the contig, I can also look at it in a browser to help point out the logic that lead to the model.

—Carson

On Dec 12, 2018, at 3:29 AM, Prashant Narendra SHINGATE <prashantns at imcb.a-star.edu.sg<mailto:prashantns at imcb.a-star.edu.sg>> wrote:

Hi Carson,

I am Prashant a Bioinformatics postdoctoral fellow from Prof B Venkatesh’s lab, IMCB, A*STAR. I am using MAKER-tool to annotate an invertebrate genome (~2Gb). During annotation process, we found several instances of split geneseven though we have full-length reference protein sequences from very closely related species. Hence we decided to look at one of the loci to understand the reason behind it and to optimize the parameters.

We looked at a gene ~110kb long and codes for a ~1200 amino acid protein. We have a highly identical reference protein (>90% identity and 100% coverage) from another species. In addition we also have a high coverage Trinitytranscript assembly from our species. Still, this gene is split into 4 fragments during evidence-based MAKER run. On closer a look, we found that the above mentioned closely related protein is not aligned by exonerate (protein2genome) even though it is the closest protein to this gene in our dataset. It looks like the program is giving more weightage to transcripts which are typically fragments of the gene. So we are at a loss as to how to predict this gene in full.

For your reference, I am herewith enclosing maker_opts.ctl file and maker_bopts.ctl. I will be glad to share the scaffold sequence and other input files if required.

Can you please help me to understand the reason behind MAKER not able to use the full-length reference protein for gene prediction and how we can overcome this problem.

Thanks for your time and help.

Best regards,

Prashant Shingate, PhD<mailto:prashantns at imcb.a-star.edu.sg> :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)
61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570<tel:(+65)%206586%209570> :: Fax (+65) 6779 1117<tel:(+65)%206779%201117>:: http://www.imcb.a-star.edu.sg/
We advance science and develop innovative technology to further economic growth and improve lives.




Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you.
<maker_opts.ctl><maker_opts.ctl>



Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181218/04488fba/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_bopts.ctl
Type: application/octet-stream
Size: 1416 bytes
Desc: maker_bopts.ctl
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181218/04488fba/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 5150 bytes
Desc: maker_opts.ctl
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181218/04488fba/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SCAffold61.zip
Type: application/x-zip-compressed
Size: 439697 bytes
Desc: SCAffold61.zip
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181218/04488fba/attachment-0002.bin>


More information about the maker-devel mailing list