Supplementary Data for:
Genome-wide Analysis of Human Disease Alleles Reveals That Their Locations Are Correlated in Paralogous Proteins
Mark Yandell, Barry Moore, Fidel Salas, Chris Mungall, Andrew MacBride, Charles White, Martin G. Reese
Download:
Overview:
The instructions located on this page are part of an analysis to identify SNPs in the coding regions of paralogous genes.
Data:
Chromosome reports from dbSNP: an ordered list of RefSNPs in approximate chromosome coordinates. Contains a great deal of information about each SNP. Data are subdivided by chromosome assignment. We use this as the main source of SNP data.
We used only the standard 24 chromosome files not: chr_MT.txt, chr_Multi.txt, chr_NotOn.txt or chr_Un.txt. This is true for all of the remaining file types as well.
The data file snp.txt was generated by concatinating all of the chromosome reports together minus their header lines.
Reference SNP (rs) Fasta: Flanking sequence for BLAST. Data are subdivided by chromosome location for most of the organisms that have data available in this format, while the data for some organisms are divided by general map placement.SNP fasta files. We used this file to get the allele sequences from the header line.
Getting the alleles from the header line of the fasta files.
ASN1_flat Files: RefSNP docsum from ASN.1 binary in human readable flatfile format ordered by chromosome. This format is a compact report that provides a good deal of human readable information about each SNP. We used these files to get the orientation of each SNP on the chromosome.
We used a perl script to print the data of interest to a flat file asn1_data.txt.
Output Data
- disease_genes_alleles_locations: Location of SNPs on disease gene protein sequences.
- disease_genes.protein.fastas: Fasta file of protein sequences for disease genes.
- disease.gene.symbols.txt: Gene symbols for disease genes
- p_snp_finder_aa.bh.disease.out: Paralogous SNPs identified by alignment of protein best hits for disease genes.
- p_snp_finder_aa.bh.out: Paralogous SNPs identified by alignment of protein best hits for all genes.
- p_snp_finder_aa.rbh.disease.out: Paralogous SNPs identified by alignment of protein reciprocal best hits for disease genes.
- p_snp_finder_aa.rbh.out: Paralogous SNPs identified by alignment of protein best hits for all genes.
- p_snp_finder.bh.disease.out: Paralogous SNPs identified by alignment of nucleotide best hits for disease genes.
- p_snp_finder.bh.out: Paralogous SNPs identified by alignment of nucleotide best hits for all genes.
- p_snp_finder.rbh.disease.out: Paralogous SNPs identified by alignment of nucleotide reciprocal best hits for disease genes.
- p_snp_finder.rbh.out: Paralogous SNPs identified by alignment of nucleotide reciprocal best hits for all genes.