A semi-supervised clustering by λ_Cut for imputation of missing data in Type II diabetes databases

Ilango, Paramasivam; Thiagarajan, Hemalatha and Savarimuthu, Nickolas (2009) A semi-supervised clustering by λ_Cut for imputation of missing data in Type II diabetes databases. Indian Journal of Medical Informatics, 4 (1). pp. 1-1. ISSN 0973-9254

Full text available as:

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
737 Kb


Data mining is used extensively in healthcare to mine patient data to construct a predictive model that is sound, makes reliable predictions and helps physicians to improve their prognosis, diagnosis or treatment planning procedures. The pathological data in medicine often produces missing data due to various reasons. Accurate and robust estimation methods of missing data are needed since the performance of the data mining algorithms heavily depend on the quality of the dataset. In this paper, an imputation method, Semi-Supervised Clustering by λ-Cut (λ-CUT_CLUST) is proposed. In the proposed method, similar records are clustered using weight as similarity measure. A high degree of intra cluster similarity is achieved by selecting those records with weights in the threshold range 0.6 ≤λ≤1. The missing data is imputed by obtaining the mean value of the respective attributes of the cluster. The method is experimented on Pima Indian Type II Diabetes dataset and the performance is compared with other imputation methods. The comparative analysis demonstrates that the method is able to impute the missing data with less imputation error and produce stable results over different percentages of missing data.

EPrint Type:Article
Uncontrolled Keywords:Missing data, λ-Cut, Cluster, Imputation Error, Type II Diabetes Dataset.
Subjects:Information Science > Data Collection
-Journal Repositories > Indian Journal of Medical Informatics
Investigative Techniques > Epidemiologic Methods > Data Collection
Endocrine Diseases > Diabetes Mellitus > Diabetes Mellitus, Type II
Information Science > Medical Informatics
Health Care Quality, Access, and Evaluation > Quality of Health Care > Health Care Evaluation Mechanisms > Data Collection
Environment and Public Health > Public Health > Epidemiologic Methods > Data Collection
Natural Sciences > Science > Models, Theoretical
ID Code:3286
Deposited By:Dr. S N Sarbadhikari
Deposited On:08 July 2009

Archive Staff Only: edit this record