Supplementary Data for:

Genome-wide Analysis of Human Disease Alleles Reveals That Their Locations Are Correlated in Paralogous Proteins

Mark Yandell, Barry Moore, Fidel Salas, Chris Mungall, Andrew MacBride, Charles White, Martin G. Reese



The instructions located on this page are part of an analysis to identify SNPs in the coding regions of paralogous genes.


Chromosome reports from dbSNP: an ordered list of RefSNPs in approximate chromosome coordinates. Contains a great deal of information about each SNP. Data are subdivided by chromosome assignment. We use this as the main source of SNP data.


We used only the standard 24 chromosome files not: chr_MT.txt, chr_Multi.txt, chr_NotOn.txt or chr_Un.txt. This is true for all of the remaining file types as well.

The data file snp.txt was generated by concatinating all of the chromosome reports together minus their header lines.

Reference SNP (rs) Fasta: Flanking sequence for BLAST. Data are subdivided by chromosome location for most of the organisms that have data available in this format, while the data for some organisms are divided by general map placement.SNP fasta files. We used this file to get the allele sequences from the header line.


Getting the alleles from the header line of the fasta files.

grep '>' data/rs_fasta/rs_ch*.fas | perl -ane '($a, $b) = $_ =~ /rs=(\d+).*alleles="(.*)"/;print "$a\t$b\n"' > allele.txt

ASN1_flat Files: RefSNP docsum from ASN.1 binary in human readable flatfile format ordered by chromosome. This format is a compact report that provides a good deal of human readable information about each SNP. We used these files to get the orientation of each SNP on the chromosome.

We used a perl script to print the data of interest to a flat file asn1_data.txt.

Output Data