SKM-SNP: SNP Markers Detection Method

 Yang Liu 1, Mark Li 1, Yiu M.Cheung2, Pak C. Sham 3, Michael K. Ng 1, *.

(1) Department of Mathematics, Hong Kong Baptist University
(2) Department of Computer Science, Hong Kong Baptist University
(3) Department of Psychiatry, The University of Hong Kong

SKM-SNP, SNP markers detection program, is proposed to identify a set of relevant SNPs for the association between a disease and multiple marker genotypes. We employ a subspace categorical clustering algorithm to compute a weight for each SNP in the group of patient samples and the group of normal samples, and use the weights to identify the subsets of relevant SNPs that categorize these two groups. The experiment on a Parkinson disease data set containing genome-wide SNPs is reported to demonstrate the program.

How does SKM-SNP work

Variations (e.g., insertions, deletions, and mutations) in the DNA sequences of humans have a major impact on genetic diseases and phenotypic differences. Single nucleotide polymorphism (SNP) is one of the most common DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual). In SNPs data sets, the association between a disease and a set of relevant SNPs are investigated. Patients and normals are often categorized in groups according to their SNPs. Thousands of SNPs in different regions of chromosomes are used to describe characteristics of patient/normal samples.

One widely used SNP selection approach for candidate gene studies is based on potential impact on protein functions or gene regulations [Reich, D. et al., 2001; Daly, M. et al., 2001; Johnson, G.C. et al., 2001; Zollner, S. et al., 2000]. In this paper, we employ a subspace clustering algorithm to determine a set of relevant SNPs. In the literature, some subspace clustering methods have been proposed and studied, see [Parsons, L. et al., 2004]. The basic idea is to find clusters from subspaces of data instead of the entire data space. In subspace data clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. The major challenge of subspace clustering, which makes it distinctive from traditional clustering, is the simultaneous determination of both cluster memberships of objects and the subspace of each cluster.

The main aim of this paper is to develop SKM-SNP, SNP markers detection program, to identify a set of relevant SNPs for the association between a disease and multiple marker genotypes. We consider that different SNPs to be categorical dimensions and they make different contributions to identification of (patient or normal) samples in clusters. The difference of contribution of a SNP (categorical dimension) is represented as a weight that can be treated as the degree of the dimension in contribution to the cluster. In subspace clustering, the decrease of the weight entropy in a cluster implies the increase of certainty of a subset of dimensions with larger weights in determination of the cluster. Therefore, in the clustering process, we simultaneously minimize the within cluster dispersion and the weight entropy to stimulate more dimensions to contribute to the identification of a cluster. A formula for computing a dimension weight is implemented to the clustering process as an additional step in each iteration, so the cluster memberships of samples and the weights of SNPs in each cluster can be obtained simultaneously.


SKM-SNP formulation

The SKM-SNP program is based on the minimization of an objective function (1):

Objective Function

Here n is the number of samples, k is the number of clusters, m is the number of SNPs, ,xj,i is the ith SNP of the jth sample, zl,i is the ith SNP of the lth center mode, δ (zl,i , xj,i) is a distance function which is equal to one if both genotypes zl,i and xj,i are the same, or equal to zero if they are different, λl,i is the weight of the ith SNP of the lth center mode, wl,j is the degree of membership of the jth sample to the lth cluster, and γ is a positive number to control the strength of the incentive for clustering on more SNPs. For detail about subspace clustering for numerical data only, we refer to [Jing,L.P. et al., 2007]. To minimize the objective function in (1), we first initialize the center modes of genotypes of SNPs and the weights of SNPs. Afterward we can start an iterative process of partitioning the samples, updating cluster centers modes of genotypes, and calculating the weights of the SNPs. The iterative loop is repeated until the objective function value does not improve. In each step, we have explicit formulae to handle the computation. The partitioning of the sample is given as follows:

Partition Updating

The center modes zl,i is set to be the genotype ai(r) if

Mode Updating

where t is an index for the number of possible combination of genotypes of the ith SNP. The dimension weights can be calculated as follows:

Dimension Weight Updating 

Here λl,i is inversely proportional to Dl,i. The smaller Dl,i, the larger λl,i, the more important the corresponding SNP. Finally, we note that the above algorithm can converges in a finite number of iterations. The complexity of the algorithm in each step depends on the number of SNPs, the number of possible combination of genotypes, the number of samples and the number of clusters.


Experimental results

The Parkinson Disease genome-wide SNPs data set downloaded from the Coriell Institute for Medical Research is used to test the algorithm. There are 271 normal (control) and 270 patient (case) samples. Table 1 shows the clustering accuracy results (correctly classified samples) for 22 chromosomes by using the SKM-SNP program. We use the most frequent genotypes in case and control groups to be the initial modes for the program. The parameter γ is tuned in each chromosome to obtain the highest accuracy in the test. We find that the average clustering accuracy of the proposed algorithm is higher than that of k-mode algorithm which is non-subspace-type [Huang, Z. et al., 1998] by 3.0%. In the table, we show the computation time of the proposed algorithm, and find that it only takes about a minute to generate the weights of the SNPs and the clustering results.

Accuracy Results

Here we use the chromosome 2 as an example to demonstrate the weights of SNPs for SNP markers detection. We first filter out those SNPs whose initial modes or final modes are the same after we run the SKM-SNP program once. The remaining number of SNPs is 1887. Then we apply SKM-SNP program again for this data set. In Table 2 and Table 3, we show the SNPs whose weights (the magnitude 10^{-4} ) rank the top ten in the case and control groups. Their corresponding percentages of genotype distributions are also shown in the table where A and a represent the major and minor alleles. The column under "M" refers to the missing percentages of genotypes in the samples. We see from the two tables that the SNPs of the top ten weights are different in the two groups. These results indicate their subspace structure of two clusters are different. Based on the weights, we can identify some relevant SNPs associated with case and control groups in a data set. Therefore we can further study these SNPs for disease-related genetic analysis.

Top Ten for Case

Top Ten for Control



Download the program

If you want to directly run this program, please click, which contains the executable program and a small example data set, that is, the 1887 SNPs of Chromosome 2 after filtering.

MAKE SURE that the program and the input files are in the same folder after uncompression. Open the Command Prompt, go to the current directory where the program and input files are located. Follow the instructions to properly run SKM-SNP. There are four input files: cc_chr2_1887.pre,, pd_chr2_1887.pre, You can input the two prefixes of the four file name, that are cc_chr2_1887 pd_chr2_1887 (MAKE SURE that the prefix of the two control files, that are cc_chr2_1887.pre and should be identical. Also do the prefix of the two case files) and a gamma value to run this program. A detailed description about the data and program can be found from the next section.

Program outputs for all the 22 chromosomes and a detailed analysis to Chromosome 2 are available here: SKM-SNP Outputs.


How to use it

1. Original Data

The original SNP data was produced by the Laboratory of Neurogenetics, of the intramural program of the National Institute on Aging, National Institutes of Health. The genotyping was performed using the Illumina Infinium I and Infinium II assays. The Original Data set has two parts: cc (Caucasian Controls) and pd (Parkinson Disease), each of them consists of two sub-parts: map file and pre file. The whole file of cc contains 271 of the actual genotypes for caucasian controls. The whole file of pd contains 270 of the genotypes for Parkinson Disease patients. Each SNP map file contains the chromosome, NCBI Build 35 position, marker name, major allele, minor allele, major allele frequency, minor allele frequency and number of missing genotypes for the marker; where the frequencies and number of missing genotypes are based upon the 271/270 individuals in the dataset. All the pre file (or genotypes file) contains the individual ID (in the format of ND-XXXX), affectation status (1 for unaffected, 2 for unaffected, can also be a class label) followed by unencoded allele calls (0 is a missing genotypes) for each genotype in the order specified within the map file. The alleles are called in forward orientation according to dbSNP.

The naming criterion (* indicates the chromosome name) is as follows: cc_*.pre, cc_*.map, pd_*.pre, pd_*.map. In our program, we use cc_*.pre and pd_*.pre as the input and we can get the marker name of corresponding SNP from .map files.

2. Data Processing

We do the data reprocessing as follows: each chromosome will be a separate file, that is to say, we combine each chromosome of cc_*.pre file and pd_*.pre file to one file. For Individual Id, we just ignore it and for Class Label, we read them out and consider them as the final class label in order to have a reference for clustering accuracy comparisions. All the remaining genotypes, we combine every two of them together to make up a SNP. For each list of SNPs, we compute the allele that appear with the most frequency and label it as the major in this particular list and all the other alleles will be the minor. For SNPs that appear with major/major, we label it as 0. For SNPs that with major/minor or minor/major, we label it as 1. For SNPs that with all the combinations of minors, we label it as 2. Missing values will be represented as 3. A small simulated example can be seen below:

cc_*.pre                 pd_*.pre
T T C C T T A G                 ND-510
T T C C A A G C                 ND-511
T A A C 0 0 T T
A A T T C A G G                                  



Step 1:
Ignore the individual ID (ND-XXXX). MAKE SURE that all the individual ID should follow the format of ND-XXXX, there should be a '-' between ND and XXXX and there is no space bewteen '-' and ND or '-' and XXXX. As we found that some of the files download from the website does not include a '-'. When reading such original files, our program will report an error.

Step 2: Use affectation status to form the class label. For these three controls, which we consider cluster 1 and two cases, which we consider cluster 2, the final class label will be:


Step 3: Combine cc_*.pre and pd_*.pre into one file, every two alleles are constructed together into one SNP, that is:


Step 4: For each list of SNP, decide a major, which appears with the most frequency. For these four SNPs, we can decide the major are T C A G respectively.

Step 5:
Transfer this allele file into a 0-3 file (there is one space after one number, but it is not a strict requirement for the last number of a line), that is:
0 0 2 1
0 0 0 1
2 2 1 0
0 1 1 0
1 1 3 2
We do our SKM-SNP program based on the above format data.


3. SKM-SNP.exe

SKM-SNP.exe is the executable program. MAKE SURE that the input files and the program are in the same folder. You will be asked to input the prefix of the two control files (they should be identical) and the prefix of the two case files (they should be identical). For example, if there are two files for control: cc_chr2_1887.pre and, and also two files for case: pd_chr2_1887.pre and, the input should be: cc_chr2_1887 pd_chr2_1887. Also, a Gamma value should be given. This value should be a >=0 real number. After these parameters been accepted by SKM-SNP, you can get two results file for these particular SNPs. One of them is Accuracy.xls, which forms the format of

Gamma Accuracy RunningTime(seconds)

the other is FinalMode&Weight.xls, which forms the format of

SNPs DimensionWeight FinalMode SNPs DimensionWeight FinalMode

A snapshot of the SKM-SNP interface can be seen below:

Figure 1. SKM-SNP interface.


4. Pay your attention

a. The executable program and the input files should be in the same folder.
b. The class labels should start from 1 and ascending with number order.
c. The prefix of the two input files for control (that are cc_*.pre and cc_*.map) should be identical. The prefix of the two input files for case (that are pd_*.pre and pd_*.map) should be identical.
d. The format of input file .pre should be strictly following: ND-** 1 A G C T...If doesnot, our program will give you a warning message.
e. As the circumscribes of memory, the number of instances should not exceed 3000 and the number of dimensions should not exceed 40000. We did such a limitation in our program and if the data file doesnot meet the requirements, there will be a warning message.



We thank the participants and the submitters for depositing samples at the NINDS Neurogenetics repository. The samples for this study are derived from the NINDS Neurogenetics repository at Coriell Cell Repositories. Access to the samples and to these data are available from the website:

Funding: Research Supported by Research Grant Council 201508 and HKBU FRGs.



Reich, D. et al. (2001) Linkage disequilibrium in the human genome. Nature, 411, 199-204.

Daly, M. et al. (2001) Highresolution haplotype structure in the human genome. Nat. Genet., 29, 229-232.

Johnson, G.C. et al. (2001) Haplotype tagging for the identification of common disease genes. Nat. Genet., 29, 233-237.

Zollner, S. et al. (2000) A coalescent approach to study linkage disequilibrium between single nucleotide polymorphisms. Am. J. Hum. Genet., 66, 615-628.

Parsons, L. et al. (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter., 6, 90-105.

Jing,L.P. et al. (2007) An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE Transactions on Knowledge and Data Engineering., 19, 1026-1041.

Huang, Z. et al. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery, 2, no. 3, 283-304.