[maker-devel] Filter transcripts to improve annotation quality ?

Thu Oct 27 03:54:31 MDT 2016

Sorry, the X and Y were switched in the plot due to a mishandling. 
Please find attached now the correct AED graph.

The round 3 (red curve) shows little higher overall AED than the second 
round (green curve) and more genes (22931 comparing to 22547 in the 
round 2). Do you think that I should stop at the second round ?

I didn'tprecise in the precedent email that the Repeat masking was done 
in Maker using the Repbase and only models found by RepeatModeler having 
identities. I letunmasked the unkown lib of RepeatModeler. In fact we 
expect a high rate of segmental and gene duplication in the genome and 
then we  could explain the high overall count of genes found by Maker.

In the other hand the high, rate of genes may be also expalined by the 
fact that I activate the alt_splice=1 option to find alternative 
splicing, do you think that it was a good idea ?

Thank you very much for your time.

Best,

Amine

Le 26/10/2016 à 20:06, Carson Holt a écrit :
> Sorry. I also assumed X and Y was flipped when I looked at it. Now I 
> read the labels, your AED curve would be weird unless the X and Y are 
> flipped in your figure.
>
> —Carson
>
>
>> On Oct 26, 2016, at 12:04 PM, Carson Holt <carsonhh at gmail.com 
>> <mailto:carsonhh at gmail.com>> wrote:
>>
>> Your AED curve looks fine. The first run (using protein2genome or 
>> est2genome I assume) will always have really low overall AED because 
>> they are exact copies of the protein/transcript alignments (so AED is 
>> meaningless there because it will always artificially look good). The 
>> protein2genome or est2genome modles also have a hard end-to-end 
>> coverage filtering cutoff of 0.5 when generated (apparent in the 
>> curve - value in maker_bopts.ctl). The next runs with SNAP show >80% 
>> of models with AED under 0.5, so it looks good. You can further look 
>> at models by adding protein domains using InterProScan in which you 
>> would expect 70-80% of models to contain a recognizable InterPro 
>> domain (false and bad models will result in very low overall domain 
>> content).
>>
>> Your overall gene counts are a little high though for an arthropod 
>> (14,000-19,000 genes would be expected as gene loss rather than gene 
>> gain is the primary evolutionary force in the Ecdysozoa). However 
>> your gene counts can be explained by either insufficient repeat 
>> masking (you can add a RepeatModeler generated library to the 
>> existing settings to help with this), poor mRNA-seq assembly or a lot 
>> of noise in the RNA-seq (this can be helped with more strict assembly 
>> parameters including the jaccard-clip option in trinity), or it is 
>> just the result of assembly fragmentation (if you have a lot of 
>> contigs or runs of NNNN in the assembly, then many genes will be 
>> split which results in inflated gene counts).
>>
>> Finally manually look at the most gene dense contigs in a browser 
>> like Apollo or IGV (gene_density = gene_count / contig_length). If 
>> the most gene dense contigs are overwhelmingly single exon, then you 
>> may need to filter out some prokaryotic assembly contamination (not 
>> uncommon). If you have contamination, it will assemble as independent 
>> contigs, so is easily blacklisted and can be identified visually 
>> (always gene dense and single exon).
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi 
>>> <mohamed.amine.chebbi at univ-poitiers.fr 
>>> <mailto:mohamed.amine.chebbi at univ-poitiers.fr>> wrote:
>>>
>>> Hi !
>>> I have tried three rounds of annotation in Maker on a non model 
>>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and 
>>> illumina reads.
>>> As suggested in the tutorial, I ran in the first round Maker with 
>>> repeat masking to generate gene models using transcript (Trinity 
>>> assembly) and protein (swissprot) evidence. Then Maker models were 
>>> used twice in a bootstrap fashion to retrain SNAP.
>>> The number of genes drops from 29207 in the round 1 to 22547 in the 
>>> round 2 then increases slightly to 22931 in the round 3.
>>>
>>> However, the AED profile (attached) don't seem to be satisfactory.
>>> So I wonder if you could let me a good strategy to improve the 
>>> annotation quality. Do you think that filtering good transcripts 
>>> could improve results. If yes , which criteria shouldbe taken into 
>>> account ?
>>> Thank you.
>>>
>>> Best;
>>> Amine
>>> <AED-Graph.pdf>_______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>

-- 
Mohamed Amine CHEBBI, PhD Student
Université de Poitiers
Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267
Equipe Ecologie Evolution Symbiose
Bât. B8-B35 - 5 Rue Albert Turpin
TSA 51106
F-86022 Poitiers Cedex 9
FRANCE
Lab website: http://ecoevol.labo.univ-poitiers.fr/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20161027/2afa34c1/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AED-Graph.pdf
Type: application/pdf
Size: 5302 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20161027/2afa34c1/attachment-0002.pdf>