Software:
WHole-genome Alignment Metrics (WHAM) is a structural variant (SV) caller that integrates several sources of mapping information to identify SVs. WHAM classifies SVs using a flexible and extendable machine-learning algorithm (random forest). WHAM is not only accurate at identifying SVs, but its association test can identify shared SVs enriched in a cohort of diseased individuals compared to a background of healthy individuals.
pVAAST
High-throughput sequencing of related individuals has become an important tool for studying human disease. However, owing to technical complexity and lack of available tools, most pedigree-based sequencing studies rely on an ad hoc combination of suboptimal analyses. Pedigree-VAAST (pVAAST) is a disease-gene identification tool designed for high-throughput sequence data in pedigrees.
PHEVOR
Phevor integrates phenotype, gene function, and disease information with personal genomic data for improved power to identify disease-causing alleles. Phevor works by combining knowledge resident in multiple biomedical ontologies with the outputs of variant prioritization tools. It does so using an algorithm that propagates information across and between ontologies. This process enables Phevor to accurately reprioritize potentially damaging alleles identified by variant prioritization tools in light of gene function, disease, and phenotype knowledge.
GPAT
The application of population genomics to non-model organisms is greatly facilitated by the low cost of next generation sequencing (NGS). Barriers, however, exist for using NGS data for population level analyses. Traditional population genetic metrics, such as Fst, are not robust to the genotyping errors inherent in noisy NGS data. Additionally, many older software tools were never designed to handle the volume of data produced by NGS pipelines. To overcome these limitations we have developed a flexible software library designed specifically for large and noisy NGS datasets. The Genotype Phenotype Association Toolkit (GPAT) implements both traditional and novel population genetic methods in a single user-friendly framework. GPAT consists of a suite of compiled tools and a Perl API that programmers can use to develop new applications. To date GPAT has been used successfully to identity genotype-phenotype associations in several real-world datasets including: domestic pigeons, Pox virus and pine rust fungus. GPAT is open source and freely available for academic use.
GPA++ is a C++ extension of The Genotype Phenotype Association Toolkit. The perl implementation of GPA has more bells and whistles than GPA++, but lacks speed.
ImagePlane
ImagePlane is python based image analysis software designed for the automated analysis of images of the animal S. mediterranea. This software allows the animals's neoblasts to be quantified and tested for assymetries along its veritcal and hoizontal axes. ImagePlane also allows simple morphology categorizations to be made based on the overall shape of the animal.
VAAST 2
VAAST 2 (the Variant Annotation, Analysis & Search Tool) is a probabilistic search tool for identifying damaged genes and their disease-causing variants in personal genome sequences. VAAST 2 builds upon existing phylogenetic conservation, amino acid substitution (AAS) and aggregative approaches to variant prioritization, combining elements of all into a single unified likelihood-framework that allows users to identify damaged genes and deleterious variants with greater accuracy, and in an easy-to-use fashion. VAAST 2 can score both coding (SNV, indel and splice site) and non-coding variants (SNV), evaluating the cumulative impact of both types of variants simultaneously. VAAST 2 can identify rare variants causing rare genetic diseases, and it can also use both rare and common variants to identify genes responsible for common diseases. VAAST 2 thus has a much greater scope of use than any existing methodology.
MAKER
MAKER is a portable and easily configurable genome annotation pipeline. It's purpose is to allow smaller eukaryotic and prokaryotic genomeprojects to independently annotate their genomes and to create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER is also easily trainable: outputs of preliminary runs can be used to automatically retrain its gene prediction algorithm, producing higher quality gene-models on seusequent runs. MAKER's inputs are minimal and its ouputs can be directly loaded into a GMOD database. They can also be viewed in the Apollo genome browser; this feature of MAKER provides an easy means to annotate, view and edit individual contigs and BACs without the overhead of a database. MAKER should prove especially useful for emerging model organism projects with minimal bioinformatics expertise and computer resources.
MAKER-P
Sequencing diverse plant species of evolutionary, agricultural, and medicinal interest is becoming routine for even small groups - genome annotation and analysis is much less so. The MAKER-P pipeline is designed to make the annotation of novel plant genomes tractable for small groups with limited bioinformatics experience and resources, and faster and more transparent for large groups with more experience and resources. The MAKER-P pipeline generates species-specific repeat libraries, as well as structural annotations of protein coding genes, non-coding RNAs, and pseudogenes.
RepeatRunner
RepeatRunner is a CGL-based program that integrates RepeatMasker with BLASTX to provide a comprehensive means of identifying repetitive elements. Because RepeatMasker identifies repeats by means of similarity to a nucleotide library of known repeats, it often fails to identify highly divergent repeats and divergent portions of repeats, especially near repeat edges. To remedy this problem, RepeatRunner uses BLASTX to search a database of repeat encoded proteins (reverse transcriptases, gag, env, etc...). Because protein homologies can be detected across larger phylogenetic distances than nucleotide similarities, this BLASTX search allows RepeatRunner to identify divergent protein coding portions of retro-elements and retro-viruses not detected by RepeatMasker. RepeatRunner merges its BLASTX and RepeatMasker results to produce a single, comprehensive XML-based output. It also masks the input sequence appropriately. In practice RepeatRunner has been shown to greatly improve the efficacy of repeat identifcation. RepeatRunner can also be used in conjunction with PILER-DF - a program designed to identify novel repeats - and RepeatMasker to produce a comprehensive system for repeat identification, characterization, and masking in the newly sequenced genomes.
CGL
CGL is a software library designed to facilitate the use of genome annotations as substrates for computation and experimentation; we call it "CGL", an acronym for Comparitive Genomics Library, and pronounce it "Seagull". The purpose of CGL is to provide an informatics infrastructure for a laboratory, department, or research institute engaged in the large-scale analysis of genomes and their annotations.