[maker-devel] Filter transcripts to improve annotation quality ?

Thu Oct 27 08:34:02 MDT 2016

Thank you Michael for your response.

As suggested by you, I would use Augustus andSnap trained both by the 
assembled transcripts in a bootstrap fashion.

For the masking, I intend to to adapt Carson strategy :

·Collecting RepeatModeler repeats.lib

·Searching Sequences in Modelerunknown.lib against a transposase 
database (derived from RepeatMasker 
<http://www.repeatmasker.org/> package and Kennedy et al (2011) 
<http://www.ncbi.nlm.nih.gov/pubmed/21535899>) and considering sequences 
matching transposases as transposons.

·Exclusion of gene fragments in both known and unkown repeats

·As I'm concerned by gene duplications, the remainder sequences in the 
unkown lib present less than 10 times will be removed.

Thank you again for your time and I remain open to any suggestion.

Best,

Amine

Le 27/10/2016 à 15:21, Michael Campbell a écrit :
> I think that if you train any further you will run the risk of 
> overtraining. setting alt_splice to 1 will add transcripts but not 
> genes, so the gene count is going to be related to the training of the 
> gene finder. I would recommend looking at a few of your large 
> scaffolds in a genome browser. I would also recommend adding a second 
> gene predictor such as augustus. When multiple predictors are used and 
> the models they predict converge you can have more confidence in the 
> gene prediction.
>
> For the masking you can make a species specific repeat library like 
> Carson suggested to see if the gene count comes down a little. If you 
> are concerned about masking duplicated genes you cad do a couple of 
> things. You can filter the repeat library based on known proteins. You 
> can also set a copy number minimum for the making and only include 
> repeats that are present more than 10 time in the genome. Here are a 
> couple of URLs for making species specific repeat libraries
> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced
> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Basic
>
> Take care,
> Mike
>
>> On Oct 27, 2016, at 5:54 AM, Mohamed Amine CHEBBI 
>> <mohamed.amine.chebbi at univ-poitiers.fr 
>> <mailto:mohamed.amine.chebbi at univ-poitiers.fr>> wrote:
>>
>>
>>
>>
>> Sorry, the X and Y were switched in the plot due to a mishandling. 
>> Please find attached now the correct AED graph.
>>
>> The round 3 (red curve) shows little higher overall AED than the 
>> second round (green curve) and more genes (22931 comparing to 22547 
>> in the round 2). Do you think that I should stop at the second round ?
>>
>> I didn'tprecise in the precedent email that the Repeat masking was 
>> done in Maker using the Repbase and only models found by 
>> RepeatModeler having identities. I letunmasked the unkown lib of 
>> RepeatModeler. In fact we expect a high rate of segmental and gene 
>> duplication in the genome and then we  could explain the high overall 
>> count of genes found by Maker.
>>
>> In the other hand the high, rate of genes may be also expalined by 
>> the fact that I activate the alt_splice=1 option to find alternative 
>> splicing, do you think that it was a good idea ?
>>
>> Thank you very much for your time.
>>
>>
>>
>> Best,
>>
>> Amine
>>
>>
>>
>> Le 26/10/2016 à 20:06, Carson Holt a écrit :
>>> Sorry. I also assumed X and Y was flipped when I looked at it. Now I 
>>> read the labels, your AED curve would be weird unless the X and Y 
>>> are flipped in your figure.
>>>
>>> —Carson
>>>
>>>
>>>> On Oct 26, 2016, at 12:04 PM, Carson Holt <carsonhh at gmail.com 
>>>> <mailto:carsonhh at gmail.com>> wrote:
>>>>
>>>> Your AED curve looks fine. The first run (using protein2genome or 
>>>> est2genome I assume) will always have really low overall AED 
>>>> because they are exact copies of the protein/transcript alignments 
>>>> (so AED is meaningless there because it will always artificially 
>>>> look good). The protein2genome or est2genome modles also have a 
>>>> hard end-to-end coverage filtering cutoff of 0.5 when generated 
>>>> (apparent in the curve - value in maker_bopts.ctl). The next runs 
>>>> with SNAP show >80% of models with AED under 0.5, so it looks good. 
>>>> You can further look at models by adding protein domains using 
>>>> InterProScan in which you would expect 70-80% of models to contain 
>>>> a recognizable InterPro domain (false and bad models will result in 
>>>> very low overall domain content).
>>>>
>>>> Your overall gene counts are a little high though for an arthropod 
>>>> (14,000-19,000 genes would be expected as gene loss rather than 
>>>> gene gain is the primary evolutionary force in the Ecdysozoa). 
>>>> However your gene counts can be explained by either insufficient 
>>>> repeat masking (you can add a RepeatModeler generated library to 
>>>> the existing settings to help with this), poor mRNA-seq assembly or 
>>>> a lot of noise in the RNA-seq (this can be helped with more strict 
>>>> assembly parameters including the jaccard-clip option in trinity), 
>>>> or it is just the result of assembly fragmentation (if you have a 
>>>> lot of contigs or runs of NNNN in the assembly, then many genes 
>>>> will be split which results in inflated gene counts).
>>>>
>>>> Finally manually look at the most gene dense contigs in a browser 
>>>> like Apollo or IGV (gene_density = gene_count / contig_length). If 
>>>> the most gene dense contigs are overwhelmingly single exon, then 
>>>> you may need to filter out some prokaryotic assembly contamination 
>>>> (not uncommon). If you have contamination, it will assemble as 
>>>> independent contigs, so is easily blacklisted and can be identified 
>>>> visually (always gene dense and single exon).
>>>>
>>>> Thanks,
>>>> Carson
>>>>
>>>>
>>>>
>>>>
>>>>> On Oct 26, 2016, at 7:09 AM, Mohamed Amine Chebbi 
>>>>> <mohamed.amine.chebbi at univ-poitiers.fr> wrote:
>>>>>
>>>>> Hi !
>>>>> I have tried three rounds of annotation in Maker on a non model 
>>>>> arthropod genome (1.7Gb) which is an hybrid assembly of Pacbio and 
>>>>> illumina reads.
>>>>> As suggested in the tutorial, I ran in the first round Maker with 
>>>>> repeat masking to generate gene models using transcript (Trinity 
>>>>> assembly) and protein (swissprot) evidence. Then Maker models were 
>>>>> used twice in a bootstrap fashion to retrain SNAP.
>>>>> The number of genes drops from29207 in the round 1 to 22547 in the 
>>>>> round 2 then increases slightly to 22931 in the round 3.
>>>>>
>>>>> However, the AED profile (attached) don't seem to be satisfactory.
>>>>> So I wonder if you could let me a good strategy to improve the 
>>>>> annotation quality. Do you think that filtering good transcripts 
>>>>> could improve results. If yes , which criteria shouldbe taken into 
>>>>> account ?
>>>>> Thank you.
>>>>>
>>>>> Best;
>>>>> Amine
>>>>> <AED-Graph.pdf>_______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com 
>>>>> <mailto:maker-devel at box290.bluehost.com>
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>
>>
>> -- 
>> Mohamed Amine CHEBBI, PhD Student
>> Université de Poitiers
>> Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267
>> Equipe Ecologie Evolution Symbiose
>> Bât. B8-B35 - 5 Rue Albert Turpin
>> TSA 51106
>> F-86022 Poitiers Cedex 9
>> FRANCE
>> Lab website:http://ecoevol.labo.univ-poitiers.fr/
>> <AED-Graph.pdf>
>

-- 
Mohamed Amine CHEBBI, PhD Student
Université de Poitiers
Laboratoire Ecologie et Biologie des Interactions - UMR CNRS 7267
Equipe Ecologie Evolution Symbiose
Bât. B8-B35 - 5 Rue Albert Turpin
TSA 51106
F-86022 Poitiers Cedex 9
FRANCE
Lab website: http://ecoevol.labo.univ-poitiers.fr/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20161027/1ad13c9e/attachment-0003.html>