[maker-devel] Filter transcripts to improve annotation quality ?

Thu Oct 27 09:08:15 MDT 2016

I do believe that you are getting a number of false positive genes because of under masking. So taking a more carful strategy (i.e. using the suggestions given by Michael) should mitigate that. You will have to decide how aggressive to be with the repeat masking (i.e. sensitivity/specificity balance). I would however turn off alt_splice. It has a very high threshold for how clean and complete mRNA alignments and repeat masking have to be in order to function correctly (reason why default is off). So given the filtering being done to pull back on repeat masking, it likely does not meet that threshold. It won’t really produce more genes, but you will get many spurious alternate transcripts.

Also for the gene count, make sure not to count from the fasta, that is the transcript count. You have to count the “gene" feature lines in the GFF3 to get the gene count. i.e. —> grep -P -c "\tgene\t" models.gff

—Carson

> On Oct 27, 2016, at 8:34 AM, Mohamed Amine CHEBBI <mohamed.amine.chebbi at univ-poitiers.fr> wrote:
> 
> 
> 
> Thank you Michael  for your response. 
> 
> As suggested by you, I would use Augustus and  Snap trained both by the assembled transcripts in a bootstrap fashion.
> 
> For the masking, I intend to to adapt  Carson strategy :
> 
> ·         Collecting RepeatModeler repeats.lib
> ·         Searching Sequences in Modelerunknown.lib  against a transposase database (derived from RepeatMasker <http://www.repeatmasker.org/> package and Kennedy et al (2011) <http://www.ncbi.nlm.nih.gov/pubmed/21535899>) and  considering sequences matching  transposases as transposons.
> ·         Exclusion of gene fragments in both known and unkown repeats
> ·         As I'm concerned by gene duplications, the remainder sequences in the unkown  lib present less than 10 times will be removed.
> 
> Thank you again for your time and I remain open to any suggestion.
>  
> Best,
> Amine
> 
> 
> Le 27/10/2016 à 15:21, Michael Campbell a écrit :
>> I think that if you train any further you will run the risk of overtraining. setting alt_splice to 1 will add transcripts but not genes, so the gene count is going to be related to the training of the gene finder. I would recommend looking at a few of your large scaffolds in a genome browser. I would also recommend adding a second gene predictor such as augustus. When multiple predictors are used and the models they predict converge you can have more confidence in the gene prediction.
>> 
>> For the masking you can make a species specific repeat library like Carson suggested to see if the gene count comes down a little. If you are concerned about masking duplicated genes you cad do a couple of things. You can filter the repeat library based on known proteins. You can also set a copy number minimum for the making and only include repeats that are present more than 10 time in the genome. Here are a couple of URLs for making species specific repeat libraries 
>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced>
>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic>
>> 
>> Take care,
>> Mike
>>  
>>> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI <mohamed.amine.chebbi at univ-poitiers.fr <mailto:mohamed.amine.chebbi at univ-poitiers.fr>> wrote:
>>> 
>>> 
>>> 
>>> 
>>> Sorry, the X and Y were switched in the plot due to a mishandling. Please find attached now the correct AED graph. 
>>> 
>>> The round 3 (red curve) shows little higher overall AED than the second round (green curve) and more genes (22931 comparing to 22547 in the round 2). Do you think that I should stop at the second round ?
>>> 
>>> I didn't  precise in the precedent email that the Repeat masking was done in Maker using the Repbase and only models found by RepeatModeler having identities. I let  unmasked the unkown lib of RepeatModeler. In fact we expect a high rate of segmental and gene duplication in the genome and then we  could explain the high overall count of genes found by Maker.
>>> 
>>> In the other hand the high, rate of genes may be also expalined by the fact that I activate the alt_splice=1 option to find alternative splicing, do you think that it was a good idea ?
>>> 
>>>  
>>> 
>>> Thank you very much for your time. 
>>> 
>>> 
>>> 
>>> Best,
>>> 
>>> Amine
>>> 
>>> 
>>> 
>>> Le 26/10/2016 à 20:06, Carson Holt a écrit :
>>>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I read the labels, your AED curve would be weird unless the X and Y are flipped in your figure.
>>>> 
>>>> —Carson
>>>> 
>>>> 
>>>>> On Oct 26, 2016, at 12:04 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>> 
>>>>> Your AED curve looks fine. The first run (using protein2genome or est2genome I assume) will always have really low overall AED because they are exact copies of the protein/transcript alignments (so AED is meaningless there because it will always artificially look good). The protein2genome or est2genome modles also have a hard end-to-end coverage filtering cutoff of 0.5 when generated (apparent in the curve - value in maker_bopts.ctl). The next runs with SNAP show >80% of models with AED under 0.5, so it looks good. You can further look at models by adding protein domains using InterProScan in which you would expect 70-80% of models to contain a recognizable InterPro domain (false and bad models will result in very low overall domain content).
>>>>> 
>>>>> Your overall gene counts are a little high though for an arthropod (14,000-19,000 genes would be expected as gene loss rather than gene gain is the primary evolutionary force in the Ecdysozoa). However your gene counts can be explained by either insufficient repeat masking (you can add a RepeatModeler generated library to the existing settings to help with this), poor mRNA-seq assembly or a lot of noise in the RNA-seq (this can be helped with more strict assembly parameters including the jaccard-clip option in trinity), or it is just the result of assembly fragmentation (if you have a lot of contigs or runs of NNNN in the assembly, then many genes will be split which results in inflated gene counts).
>>>>> 
>>>>> Finally manually look at the most gene dense contigs in a browser like Apollo or IGV (gene_density = gene_count / contig_length). If the most gene dense contigs are overwhelmingly single exon, then you may need to filter out some prokaryotic assembly contamination (not uncommon). If you have contamination, it will assemble as independent contigs, so is easily blacklisted and can be identified visually (always gene dense and single exon).
>>>>> 
>>>>> Thanks,
>>>>> Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi < <mailto:mohamed.amine.chebbi at univ-poitiers.fr>mohamed.amine.chebbi at univ-poitiers.fr <mailto:mohamed.amine.chebbi at univ-poitiers.fr>> wrote:
>>>>>> 
>>>>>> Hi !
>>>>>> I have tried three rounds of annotation in Maker on a non model arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and illumina reads.
>>>>>> As suggested in the tutorial, I ran in the first round Maker with repeat masking to generate gene models using transcript (Trinity assembly) and protein (swissprot) evidence. Then Maker models were used twice in a bootstrap fashion to retrain SNAP.  
>>>>>> The number of genes drops from 29207 in the round 1 to 22547 in the round 2 then increases slightly to 22931 in the round 3.
>>>>>> 
>>>>>> However, the AED profile (attached) don't seem to be satisfactory.
>>>>>> So I wonder if you could let me a good strategy to improve the annotation quality. Do you think that filtering good transcripts could improve results. If yes , which criteria should be taken into account ?
>>>>>> Thank you.
>>>>>> 
>>>>>> Best;
>>>>>> Amine
>>>>>> <AED-Graph.pdf>_______________________________________________
>>>>>> maker-devel mailing list
>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>> 
>>>> 
>>> 
>>> -- 
>>> Mohamed Amine CHEBBI, PhD Student
>>> Université de Poitiers
>>> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267
>>> Equipe Ecologie Evolution Symbiose
>>> Bât. B8-B35 - 5 Rue Albert Turpin
>>> TSA 51106
>>> F-86022 Poitiers Cedex 9
>>> FRANCE
>>> Lab website: http://ecoevol.labo.univ-poitiers.fr/ <http://ecoevol.labo.univ-poitiers.fr/><AED-Graph.pdf>
>> 
> 
> -- 
> Mohamed Amine CHEBBI, PhD Student
> Université de Poitiers
> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267
> Equipe Ecologie Evolution Symbiose
> Bât. B8-B35 - 5 Rue Albert Turpin
> TSA 51106
> F-86022 Poitiers Cedex 9
> FRANCE
> Lab website: http://ecoevol.labo.univ-poitiers.fr/ <http://ecoevol.labo.univ-poitiers.fr/>_______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20161027/e7957956/attachment-0003.html>