[maker-devel] Filter transcripts to improve annotation quality ?

Wed Oct 26 12:04:20 MDT 2016

Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content).

Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts).

Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon).

Thanks,
Carson

> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi <mohamed.amine.chebbi at univ-poitiers.fr> wrote:
> 
> Hi !
> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads.
> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP.  
> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3.
> 
> However, the AED profile (attached) don't seem to be satisfactory.
> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ?
> Thank you.
> 
> Best;
> Amine
> <AED-Graph.pdf>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20161026/84692fbb/attachment-0003.html>