[maker-devel] Too many genes?

Mon Oct 17 18:09:52 MDT 2016

It sounds like your repeat masking is probably sufficient. Perhaps just the change of removing SNAP this time will give you what you want.

—Carson

> On Oct 17, 2016, at 5:13 PM, Annabel Beichman <annabel.beichman at gmail.com> wrote:
> 
> Thank you so much for all these suggestions, Carson! I will give them a try, particularly dropping SNAP as it definitely doesn’t show great concordance compared to Augustus. 
> 
> Do you have any additional recommendations for improving my repeat masking? I have already made a custom repeat library in repeatmodeler following this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic and have model_org=all and repeat_protein=/home/opt/maker/data/te_proteins.fasta
> 
> My interproscan results have ~73% of my total genes (including genes with high AED scores) with Pfam domains, so it at least seems like I’m on the right track.
> 
> Thanks so much again,
> 
> ~ Annabel
> 
> 
>> On Oct 17, 2016, at 1:25 PM, Carson Holt <carsonhh at gmail.com> wrote:
>> 
>> Better training and repeat masking will result in fewer false positive gene calls. Depending on how many contigs there are in the genome, you may also get gene fragmentation (genes split across contigs or genes split due to short runs of NNNNN within a contig). Fragmented genes tend to lack start or stop codons. Finally pick a few of the contigs with the highest gene density and look at them in a browser. If one of the gene predictors you are using (SNAP or Augustus) does not have good concordance with the models, you may want to drop the predictor (sometimes a predictor does not work well on a particular genome for one reason or another - SNAP tends to have issues with mammalian genomes for example). Also when looking at the contig, if you see contig consisting of only single exon genes then you may have some prokaryotic contamination (they assemble as independent gene dense contigs - so a good thing to look at if gene counts are high). Finally high gene counts can mean that repeats are still under masked (repeats encode real proteins like transposases).
>> 
>> You can also scan all resulting models with InterProScan to see what fraction contain identifiable protein domains (a well annotated genome will have ~75-85% of genes with an InterPro domain).
>> 
>> —Carson
>> 
>> 
>> 
>>> On Oct 17, 2016, at 1:20 PM, Annabel Beichman <annabel.beichman at gmail.com> wrote:
>>> 
>>> Hi Carson et al., 
>>> 
>>> Thanks so much for such a great pipeline, tutorials and advice pages.
>>> 
>>> I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. 
>>> 
>>> Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. 
>>> 
>>> Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb.
>>> 
>>> Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. 
>>> 
>>> I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). 
>>> 
>>> However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. 
>>> 
>>> 22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter’s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 – when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). 
>>> 
>>> I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.?
>>> 
>>> 
>>> Not sure if this is significant, but one thing I’ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold:
>>> ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens)
>>> ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens)
>>> ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus)
>>> ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens)
>>> 
>>> Thank you all so much for your help and advice!
>>> 
>>> [I also want to report an odd behavior, that may be specific to our server – when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno’s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.]
>>> 
>>> Best wishes,
>>> Annabel Beichman
>>> Wayne Lab/Lohmueller Lab
>>> Ecology & Evolutionary Biology
>>> UCLA
>>> Annabelbeichman.com
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
>