OpenAccess: Closed

This work is not Open Acccess. We may still have a PDF, if this is the case there will be a green box below.

Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data

Yufeng Liu,D. Neil Hayes,Andrew B. Nobel,J. S. Marron

Cluster analysis

Sample size determination

Data mining

2008

AbstractClustering methods provide a powerful tool for the exploratory analysis of high-dimension, low–sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are “really there,” as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.KEY WORDS: ClusteringHigh-dimension, low–sample size datak-meansMicroarray gene expression datap valueStatistical significance

Cite this:

Generate Citation

“Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data” is a paper by Yufeng Liu D. Neil Hayes Andrew B. Nobel J. S. Marron published in 2008. It has an Open Access status of “closed”. You can read and download a PDF Full Text of this paper here.