[maker-devel] Fwd: exon/intron boundaries

Mon Oct 21 11:40:06 MDT 2013

Hi,

thanks for your help- I have used Trinity to assemble the EST's but am
having a problem with the 2.29-beta release. Instead of starting and
finishing a set of scaffolds, it appears to re-start over and over again. I
am running it in parallel across 32 cores and I'm wondering if this is a
problem with the parallel implementation. I've gone through it several
times and not been able to find any errors or output that would suggest
what the problem is. Thanks -Janna

On Mon, Oct 7, 2013 at 6:20 AM, Carson Holt <carsonhh at gmail.com> wrote:

> Hi Janna,
>
> There are a couple of things to do.  Download the maker-2.29p-beta from
> the lab website.  This includes some changes made to improve the
> performance of correct_est_fusion.  Whenever using the correct_est_fusion
> option, this also reduces the influence of ESTs on annotation.  So normally
> you would need to increase your protein dataset, but given that you have
> supplied 3 nematode species and all of UniProt already you should probably
> be fine. But there is one last thing you can do.  Instead of using
> cufflinks, try using trinity to assemble the ESTs.  There is a Jaccard clip
> option that reducing merging caused by overlapping UTR.  Between the
> trinity and correct_est_fusion you should be able to really reduce the
> effect of those ESTs.
>
> If those changes don't work there is one last option.  If you take the
> MAKER results and filter them for SNAP and Augustus ab initio results
> (match/match_part in the GFF3), then you can pass those in to the pred_gff
> options.  Then turn snaphmm and augustus_species off in the control files.
>  Basically what this will do is turn MAKER's hint based prediction off and
> force it to filter the ab intio results and select directly models from
> there.  Since the merging is being caused by bad hints (merged transcripts
> from mRNAseq) this would reduce that effect.  You will still need
> correct_est_fusion=1 though to trim UTR coming from the merged transcripts,
> because even though MAEKR can't process hints to rerun SNAP and Augustus
> this way, it will try and add UTR using the EST evidence.
>
> --Carson
>
>
> From: Janna Fierst <jfierst at uoregon.edu>
> Date: Friday, October 4, 2013 1:06 PM
> To: <maker-devel at yandell-lab.org>
> Subject: [maker-devel] Fwd: exon/intron boundaries
>
> Hi,
>
> thanks for your reply- I have been going through our annotations in detail
> and trying different parameter sets, and I think I have identified what is
> going on but I'm not sure how to set the MAKER2 parameters for our
> situation. We are working with a species of Caenorhabditid worm and there
> are long gene-dense blocks that are being incorrectly annotated as large
> single genes instead of several smaller closely spaced genes. The protein
> alignments (tblastx and protein2genome) show very clearly where the
> exon/intron boundaries are and in most cases agree with the augustus
> predictions. The assembled cufflinks output (through blastn, est2genome and
> est_gff:cufflinks) does not agree in some locations; I think this may be
> because in some cases the UTR nearly overlaps adjacent genes.
>
> I have included a screenshot of an annotated region viewed in apollo to
> try to show this. The large gene in the middle is actually 7 different
> genes that are extremely close together and MAKER2 is collapsing them into
> a single gene. I tried running without any RNASeq/cufflinks data and MAKER2
> annotates the region as two genes instead of one, but I can't get it to
> recognize the 7 as different genes. I have retrained SNAP but we have not
> been able to successfully train Augustus, we are currently using the
> default caenhorhabditis species model. I also included a species specific
> repeat library. I tried setting correct_est_fusion=1 and reducing
> pred_flank but these changes appear to really alter the annotations and we
> end up annotating almost nothing. I also tried setting est2genome=0 to
> decrease the influence of the cufflinks assembly but it didn't appear to
> help. There are some very large introns in these genes so I haven't tried
> yet decreasing the maximum intron size because I'm concerned this may
> generate too many split genes instead of our current merged gene problem.
> Thanks for your help, any advice is greatly appreciated! -Janna Fierst
>
>
> On Mon, Aug 26, 2013 at 12:21 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Are you getting gene fusions or just more exons?  Gene fusions can be
>> reduced by setting correct_est_fusion=1, or reducing pred_flank, although
>> reducing pred_flank can cause other issues (but those generally only appear
>> if setting the value below below 150).  Also if you have the maximum intron
>> size set to high (split_hit option), you may also be generating bridging
>> alignments that make evidence align across distant paralogous genes as well
>> (this can result in gene merging)
>>
>> You should also look at your results manually in a viewer like Apollo.
>>  Then see if the extra exons are supported by something such as protein
>> alignments from another species.  If this is the case, you may have a
>> poorly annotated protein set that is being used as evidence that is
>> carrying over it's erroneous exons into the species you are annotating.  If
>> the extra  exons are supported by EST evidence, then perhaps you should try
>> and rebuild the EST assembly (for example trinity has an option to use a
>> Jarccardian similarity coefficient to avoid fusing transcripts).
>>
>> Another option, is to retrain SNAP or Augustus.  MAKER does not actually
>> produce any of the models itself (it is a pipeline not a predictor).  The
>> models are all generated using these other algorithms, MAKER just feeds
>> them hints based on protein and transcript alignments, so making sure
>> training is sufficient is important for those programs to produce their
>> best models.
>>
>> Finally make sure your repeat database is sufficient, you may need to
>> generate a species specific repeat library using something like
>> RepeatModeler.  Repeats can end up being included as extra exons in gene
>> models because they may contain reading frames the do code for proteins
>> (I.e. reverse transcriptases).
>>
>> If you have any questions on any of the above, just let us know.
>>
>> Thanks,
>> Carson
>>
>>
>> From: Janna Fierst <jfierst at uoregon.edu>
>> Date: Monday, August 26, 2013 2:54 PM
>> To: <maker-devel at yandell-lab.org>
>> Subject: [maker-devel] exon/intron boundaries
>>
>> Hi,
>>
>> I am using MAKER 2.28 to annotate a Caenorhabditid worm genome, and the
>> initial results appear fairly good but we seem to be be annotating too many
>> exons for multiple genes. I was wondering which parameters should be tuned
>> to change the threshold for exon/intron boundaries? Thanks for your help
>> -Janna Fierst
>> _______________________________________________ maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
>
> _______________________________________________ maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20131021/066d1dac/attachment-0003.html>