[maker-devel] Maker protein match & tandem similar genes

Tue Aug 22 10:38:47 MDT 2017

Thanks Carson, I appreciate your insights.  Has been interesting to learn about the the whole genome annotation process.   Makes me realize that is is really not a solved area, but I’m glad that maker exists and is easy enough to use for someone who isn’t an expert in it.  Is there somewhere I could put your response on the Maker documentation wiki?

As I was mentioning earlier in the thread, the ab-initio predictor (augustus) was making errors sublte errors (splice donor site being ~12 nt downstream than supported), despite being trained (I trained through BUSCO, for ease), and having an aligned transcript “hint” which had the correct structure.  I believe the maker configuration was correct.  Beyond troubleshooting the augustus training, which seems a bit complicated, and doing manual curation / fixing of the gene models (which seems to be a bandaid over my potentially misconfigured augustus training?), going with a purely est2genome=1 approach seems to be a nice way to do it.  Better in my opinion to have a known unknown (obvious errors, fragmented genes that are supported by transcript evidence), that unknown unknowns (subtle errors in exon-exon junctions from augustus).

A quick question: Could you confirm / deny that Maker doesn’t annotate non-coding RNA genes?  E.g. I’ve picked up some rRNAs and ncRNAs in my de novo transcriptome, but my understanding is that est2genome and the ab-inition approach requires that an ORF be present, hence no non-coding RNA genes (beyond the tRNAs and whatnot that can be specifically included)

All the best,
-Tim 

> On Aug 22, 2017, at 12:15 PM, Carson Holt <carsonhh at gmail.com> wrote:
> 
> The est2genome option takes an alignment and then just identifies the longest ORF in that alignment and turns it into a model (these models are good enough to rain with). The reason est2genoime is not recommended for the annotation step are several.
> 
> 1. They are likely to be partial in many cases or include a number of merged assemblies depending on the organism.
> 2. They will likely only represent a fraction of the genes so relying only on est2genome models will result in low sensitivity since not everything is expected to be expressed or assembled.
> 3. Ab initio predictors that receive the intron/exon hints from a transcript evidence alignment should still replicate the structure of correctly assembled transcripts anyways, but they can still call genes in other regions without transcript evidence to improve sensitivity or improve structure for models with only partial transcript evidence alignment (i.e. they can complete the incomplete models).
> 4. If their are assembly error’s (you can expect a lot of these in a draft assembly), ab initio predictors can work around the errors to create an intron/exon structure with reasonable similar ORF that will be similar to the true ORF if not exactly correct, where alignment based methods cannot and will just produce a truncated ORF.
> 5. est2genome models will always have great AED scores (falsely good scores) since they are their own evidence and match themselves exactly, so spurious alignments and partial alignments always score very high even when they are bad models. By turning est2genome off, you allows the HMM scoring mechanism to act as an additional filter on those models.
> 
> If you have really really good transcript evidence, it is possible for est2genome to work very well. But the likelihood of having evidence that perfect is low. So for those reasons we recommend using it only as an intermediate step.
> 
> —Carson
> 
> 
> 
>> On Aug 19, 2017, at 10:38 AM, Tim Fallon <tfallon at mit.edu <mailto:tfallon at mit.edu>> wrote:
>> 
>> Hi Carson,
>> 
>> Just a follow up to this, for posterity.  I was able to do what I wanted by using just the est2genome=1, and turning off protein2genome.  The input to the est2genome is a Trinity de novo transcriptome assembly with strand specific libraries + assembly and jaccard clip.  The results seem quite reliable, and I’m not getting the problem where tandem similar genes were getting fused anymore (the original problem with this inquiry). I expect this is due to there being enough nucleotide differences in the est2genome alignment of two similar and tandem transcripts to effectively distinguish them.
>> 
>> In any event, it wasn’t clear to me that est2genome=1 alone would produce ORF/CDS predictions (for the genes), and I’ve done a lot of reading around the Maker documentation and papers.  Might be worth considering making the documentation more clear in this respect in the future.  I know that est2genome & protein2genome were originally intended more as an intermediate step for ab-initio gene predictor training, but in my opinion with the quality and cost-effectivness of transcript discovery RNA-Seq, it seems reasonable to ditch the ab-initio gene prediction and go entirely with a “est2genome=1” like approach.  It might be worthwhile to document what your thought process would be for reliable ab-initio free gene annotation w/ Maker.  I’ll mention I haven’t looked into the PASA pipeline for this, which is the only other major publicly available gene structure annotation pipeline known to me, as the parallelization in Maker has been working quite well for me.
>> 
>> Are the heuristics for this ORF prediction in est2genome=1 documented anywhere?  E.g., does it only pick the longest ORF per transcript?  Or if there are multiple “good” ORFs (>200 amino acids) per transcript, will it try and split those into different genes?  I ask as my current task is trying to merge the previously mentioned de novo transcriptome derived gene models from est2genome with est2genome gene models of a reference guided transcriptome assembly.  Although the reference guided transcript assembly captures more genes that the Trinity assembly (by tblastn), the transcripts are notably artifactually chimeric, sometimes containing 4-5 CDSs, so the heuristics for the Maker est2genome could be pretty influential.
>> 
>> All the best,
>> -Tim
>> 
>>> On Jul 13, 2017, at 1:05 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>> 
>>> est2genome and protein2genome take BLAST hits, polish them with exonerate around splice sites and then turn the alignment directly into a gene model. So if the alignment is partial because the EST or mRNA-seq do not cross the entire transcript or the protein homology does not cross the entire CDS, then the resulting model will be partial. It can be end to end, but partial tends to be more common than not unless you are using a protein evidence library with limited divergence.
>>> 
>>> —Carson
>>> 
>>> 
>>> 
>>>> On Jul 10, 2017, at 2:00 PM, Tim Fallon <tfallon at mit.edu <mailto:tfallon at mit.edu>> wrote:
>>>> 
>>>> Hi Carson,
>>>> 
>>>> So far what I've noticed with just est2genome, and protein2genome, using only de novo assembled transcripts with transdecoder predicted peptides  (both mapped in maker with blast evalue limit = 1e-50), the gene models (for the genes where I have enough information about the "correct" gene structure), have been full length.  Is this unexpected?
>>>> 
>>>> Will try Apollo. Though I'd like to avoid manual curation.  Perhaps it is worth talking to the Augustus developers to see why Augustus was making the exon error in my key gene that led me to ditching it altogether.
>>>> 
>>>> Agree there are varying qualities of draft assemblies.  In our case, we did 100X Illumina hybrid assembly w/ 50X PacBio.  The local structure so far seems to be pretty good.
>>>> 
>>>> Good to know that the human and mouse assemblies even have gene errors, makes me feel better about how much time I've put in trying to get my genome annotation perfect :) 
>>>> 
>>>> All the best,
>>>> -Tim
>>>> 
>>>> On Jul 10, 2017, at 3:20 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>> 
>>>>> est2genome and protein2genome will almost always be partial. Also the error rate on draft assemblies is much higher than most people realize. Beyond issues already mentioned in the previous e-mail, there is also the issue that organisms are diploid, but the assembly is haploid, so variation gets squashed which also breaks ORFs (there are several examples of this in both the mature human and mouse genome assemblies). For many draft assemblies, you can expect ORF affecting errors in as much as 10-15% of your annotations.
>>>>> 
>>>>> Try opening the cases with issues and manually editing them in Apollo. Possible sources of sequence guiding the annotation may become more apparent (look at mismatches in the mRNA-seq alignments relative to the assembly for example). And if not, and the region is just too complex for the predictor, then you can force the model with Apollo. 
>>>>> 
>>>>> —Carson
>>>>> 
>>>>> 
>>>>>> On Jul 6, 2017, at 6:45 AM, Tim Fallon <tfallon at mit.edu <mailto:tfallon at mit.edu>> wrote:
>>>>>> 
>>>>>> Hi Carson,
>>>>>> 
>>>>>> This region is definitely entirely correct at the genomic nucleotide level, no missassemblies.  Would you have any strong reservations about ditching the ab-initio prediction and sticking entirely with the est2genome predictions and protein2genome predictions?  Right now this is what I’m thinking, as troubleshooting the ab-initio training seems like it could be a long road.
>>>>>> 
>>>>>> All the best,
>>>>>> -Tim
>>>>>> 
>>>>>>> On Jun 26, 2017, at 6:00 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>>> 
>>>>>>> Augustus uses an HMM with scoring bonuses for evidence match. If a difference in the assembly breaks the ORF anywhere in the transcript relative to the evidence or removes high scoring transcript start/stop sequences, then Augustus will add/skip exons or trim/extend transcripts to capture what scoring bonuses it can as best it can. So wherever you see Augustus behaving weirdly, you likely have something off in the assembly (small stretch of NN’s or single basepair duplications/deletions that affect the ORF and scoring model). So what Augustus produces is the best fit gene model to hop around assembly anomalies while still producing a canonical model. 
>>>>>>> 
>>>>>>> In areas like the ones I describe above, EVM refuses to produce any model. So you can experiment with the EVM options in MAKER3, but what you may find is that problem regions tend to get no models with EVM. I believe using the pred_gff trick I mentioned previously may be the easiest work around. Also make sure to prefilter mRNA-seq evidence to avoid transcript joining (trinity has a jaccard_clip option which can help). Because if you are getting transcript joining in proteins, you are almost certain to get it in transcript evidence as well.
>>>>>>> 
>>>>>>> —Carson
>>>>>>> 
>>>>>>>> On Jun 22, 2017, at 10:59 PM, Tim Fallon <tfallon at mit.edu <mailto:tfallon at mit.edu>> wrote:
>>>>>>>> 
>>>>>>>> Hi Carson,
>>>>>>>> 
>>>>>>>> Thanks for the response! After sending my initial email, I did notice this particular issue was warned about in the Cambell et al. 2014 Maker protocols paper.  Perhaps future versions of the pipeline might have a workaround or warning for this presumably common issue. At least in my case, the genome I’m annotating has large introns, and also tandem gene clusters of homologous genes, so I’ve been unable to solve this issue entirely by changing existing parameters (e.g. split_hit), though perhaps exonerate / protein2genome direct gene annotation does handle it correctly.
>>>>>>>> 
>>>>>>>> Regarding the protein2genome only being a intermediate stage, as I’ve been working towards a final annotation, I’ve actually been mostly relying on the protein2genome direct gene annotation, as although I have a trained Augustus that is presumably getting the hints from the evidence, my main target genes have been producing subtly wrong gene-models (Augustus produced splice sit off by a handful of nucleotides, leading to unintended & unsupported amino acids in the protein).  I also trained SNAP, but those predictions were worse than the Augustus predictions.
>>>>>>>> 
>>>>>>>> Do you have any tips for using the Evidence Modeler integration of the Maker 3.0.0 beta?  That seems to be the best way to have the final gene models rely more on extrinsic evidence over my mildly incorrect ab-initio predictions.  Or perhaps PASA is more appropriate for gene-models that would strictly adhere to extrinsic de novo assembled transcript / predicted ORF evidence?
>>>>>>>> 
>>>>>>>> All the best,
>>>>>>>> -Tim
>>>>>>>> 
>>>>>>>>> On Jun 23, 2017, at 12:27 AM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> The protein_match features are the direct BLASTX results. Because of how BLAST works, if you have neighboring paralogs, it can place HSPs in both. So the final hit ends up being to large. The protein2genome feature is then the result of exonerate polishing these blast alignments (this will usually remove false merging and bad exon order). The protein2genome=1 option on the other hand just tell maker that you want to try and convert the exonerate hits directly into gene models (only do this for training and not final annotation). One way to drop the BLASTX results may be to filter the GFF3 results to keep only protein2genome features, pass those into protein_gff, and then turn off protein= for the next run. This forces the blastx results to be dropped. You may want to set blast_depth parameters to something like 10 in maker_bopts.ctl before doing this to trim per locus evidence depth to 10 if you are using too much input data.
>>>>>>>>> 
>>>>>>>>> —Carson
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jun 13, 2017, at 11:35 AM, Tim Fallon <tfallon at mit.edu <mailto:tfallon at mit.edu>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi there,
>>>>>>>>>> 
>>>>>>>>>> I am aligning reference proteins to an insect genome through Maker, in preparation for using the gene models from the protein alignments as evidence to train SNAP (alongside de-novo assembled RNA-Seq).  I also plan on passing the protein alignments to a future Maker run as hints for SNAP / Augustus.
>>>>>>>>>> 
>>>>>>>>>> I’ve noticed that the maker blastx "protein_match” feature, which I presume is a result of Maker trying to make the blastx HSPs contiguous to format as a reference for exonerate (this Maker run did have protein2genome turned on), tends to fuse tandem genes from the same gene family.  See attached image.
>>>>>>>>>> 
>>>>>>>>>> The red regions highlight two de novo assembled transcripts which I aligned manually, from two genes that are homologous.  The top track is the blastx “match_part” features, the bottom track is the blastx “protein_match” features.  You can see that the protein_match fuses the two genes, using ~1000 bp in an intervening region, that doesn’t have blastx HSP support in the blastx “match_part” track.  The trick seems to be that a single reference protein, has blastx matches on both the left and right gene.
>>>>>>>>>> 
>>>>>>>>>> Cleary this isn’t a good gene model to train SNAP with, but would this misannotation screw up the hints passed to pretrained SNAP / Augustus?
>>>>>>>>>> 
>>>>>>>>>> Is there anyway to prevent this protein_match fusing of adjacent similar genes from happening?  For species that are closer, I’ve set the “eval_blastx” to be a lot higher (1e-50), and in that case the genes don’t get fused (but, with that level of stringent search, it is more like an orthology search, rather than just annotating general protein similarity).  I do have (rare) introns ~1000 bp, so I wouldn’t want to change the Maker “split_hit” parameter to be too low.
>>>>>>>>>> 
>>>>>>>>>> All the best,
>>>>>>>>>> -Tim
>>>>>>>>>> 
>>>>>>>>>> Timothy R. Fallon
>>>>>>>>>> PhD candidate
>>>>>>>>>> Laboratory of Jing-Ke Weng
>>>>>>>>>> Department of Biology
>>>>>>>>>> MIT
>>>>>>>>>> 
>>>>>>>>>> tfallon at mit.edu <mailto:tfallon at mit.edu>
>>>>>>>>>> 
>>>>>>>>>> <protein_match_example.png>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Timothy R. Fallon
>>>>>>>> PhD candidate
>>>>>>>> Laboratory of Jing-Ke Weng
>>>>>>>> Department of Biology
>>>>>>>> MIT
>>>>>>>> 
>>>>>>>> tfallon at mit.edu <mailto:tfallon at mit.edu>
>>>>>>> 
>>>>>> 
>>>>>> Timothy R. Fallon
>>>>>> PhD candidate
>>>>>> Laboratory of Jing-Ke Weng
>>>>>> Department of Biology
>>>>>> MIT
>>>>>> 
>>>>>> tfallon at mit.edu <mailto:tfallon at mit.edu>
>>>>> 
>>> 
>> 
>> 
>> 
> 

Timothy R. Fallon
PhD candidate
Laboratory of Jing-Ke Weng
Department of Biology
MIT

tfallon at mit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170822/b74842a8/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1853 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170822/b74842a8/attachment-0003.p7s>