Clustering Performance on Heart Disease Data: Effects of Distance Metrics and Scaling
DOI:
https://doi.org/10.47134/jtsi.v3i1.5336Keywords:
Cardiovascular Diseases, Clinical Data Analysis, Unsupervised Machine LearningAbstract
Cardiovascular diseases (CVD) are one of the leading causes of morbidity and mortality worldwide, requiring advanced analytical approaches to identify early-stage risk groups and classify patient profiles in greater detail. The aim of this study is to reveal latent patient subgroups associated with CVD using unsupervised machine learning methods on clinical data. In this context, a dataset consisting of 11 clinical variables from 303 patients who visited the VA Medical Center in Long Beach, California, was analyzed. During the preprocessing stage, missing observations were eliminated, only numerical variables were used, and both z-score standardization and min–max normalization were applied to the data. Subsequently, hierarchical clustering analyses were performed using single, complete, and average linkage approaches based on Euclidean and cosine distance measures) (the number of possible clusters for different distance–scaling combinations was evaluated using the Elbow and Silhouette measures. The results obtained showed that the 4-cluster solution, particularly under the complete and average linkage methods, represented the data structure in the most clinically explanatory manner. The similarity between the clustering results obtained using the k-means algorithm with Euclidean distance in standardized data and cosine distance in normalized data was calculated as the Rand Index (RI) = 0.8179) (this value demonstrated that the cluster structure was largely preserved despite different distance metrics and scaling strategies. The findings demonstrate that unsupervised learning-based clustering approaches provide a useful tool for defining meaningful risk classes within heterogeneous patient populations based on clinical datasets and for conducting comparative clinical evaluations between these classes.
References
Ahsan, M. M., Mahmud, M. P., Saha, P. K., Gupta, K. D., & Siddique, Z. (2021). Effect of data scaling methods on machine learning algorithms and model performance. Technologies, 9(3), 52. https://doi.org/10.3390/technologies9030052
Alanazi, A. (2022). Using machine learning for healthcare challenges and opportunities. Informatics in Medicine Unlocked, 30, 100924. https://doi.org/10.1016/j.imu.2022.100924
Ali, L., Rahman, A., Khan, A., Zhou, M., Javeed, A., & Khan, J. A. (2019). An automated diagnostic system for heart disease prediction based on X^2 statistical model and optimally configured deep neural network. IEEE access, 7, 34938-34945. https://doi.org/10.1109/ACCESS.2019.2904800
.
Ali, P. J. M. (2022). Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO-The Scientific Journal of Koya University, 10(1), 85-91. https://doi.org/10.14500/aro.10955
Alsabti, K., Ranka, S., & Singh, V. (1997). An efficient k-means clustering algorithm (43). https://surface.syr.edu/eecs/43
Ambrish, G., Ganesh, B., Ganesh, A., Srinivas, C., & Mensinkal, K. (2022). Logistic regression technique for prediction of cardiovascular disease. Global Transitions Proceedings, 3(1), 127-130. https://doi.org/10.1016/j.gltp.2022.04.008
Bharti, R., Khamparia, A., Shabaz, M., Dhiman, G., Pande, S., & Singh, P. (2021). Prediction of heart disease using a combination of machine learning and deep learning. Computational intelligence and neuroscience, 2021(1), 8387680. https://doi.org/10.1155/2021/8387680
Capotosto, L., Massoni, F., De Sio, S., Ricci, S., & Vitarelli, A. (2018). Early diagnosis of cardiovascular diseases in workers: role of standard and advanced echocardiography. BioMed Research International, 2018(1), 7354691. https://doi.org/10.1155/2018/7354691
Chang, V., Bhavani, V. R., Xu, A. Q., & Hossain, M. (2022). An artificial intelligence model for heart disease detection using machine learning algorithms. Healthcare Analytics, 2, 100016. https://doi.org/10.1016/j.health.2022.100016
Chew, E. Y., Burns, S. A., Abraham, A. G., Bakhoum, M. F., Beckman, J. A., Chui, T. Y., Finger, R. P., Frangi, A. F., Gottesman, R. F., & Grant, M. B. (2025). Standardization and clinical applications of retinal imaging biomarkers for cardiovascular disease: a Roadmap from an NHLBI workshop. Nature Reviews Cardiology, 22(1), 47-63. https://doi.org/10.1038/s41569-024-01060-8
Cinar, I., Taspinar, Y. S., Kursun, R., & Koklu, M. (2022). Identification of corneal ulcers with pre-trained AlexNet based on transfer learning 2022 11th Mediterranean conference on embedded computing (MECO),
DeGuire, J., Clarke, J., Rouleau, K., Roy, J., & Bushnik, T. (2019). Blood pressure and hypertension. Health Rep, 30(2), 14-21. https://doi.org/10.25318/82-003-x201900200002
Erdem, K., Yasin, E., Yıldız, M. B., & Koklu, M. (2024). Classification of Heart Diseases with Ensemble Learning Algorithms. Sinop Üniversitesi Fen Bilimleri Dergisi, 9(2), 369-387. https://doi.org/10.33484/sinopfbd.1458580
Erdem, K., Yıldız, M. B., Yasin, E. T., & Köklü, M. (2023). A Detailed analysis of detecting heart diseases using artificial intelligence methods. Intelligent Methods In Engineering Sciences, 2(4), 115-124. https://doi.org/10.58190/imiens.2023.71
García-Vicente, C., Chushig-Muzo, D., Mora-Jiménez, I., Fabelo, H., Gram, I. T., Løchen, M.-L., Granja, C., & Soguero-Ruiz, C. (2023). Evaluation of synthetic categorical data generation techniques for predicting cardiovascular diseases and post-hoc interpretability of the risk factors. Applied Sciences, 13(7), 4119. https://doi.org/10.3390/app13074119
Gaziano, T., Reddy, K. S., Paccaud, F., Horton, S., & Chaturvedi, V. (2006). Cardiovascular disease. In D. Jamison, J. Breman, & A. Measham (Eds.), Disease Control Priorities in Developing Countries. 2nd edition (2nd ed.). The International Bank for Reconstruction and Development / The World Bank.
Ghiasi, M. M., Zendehboudi, S., & Mohsenipour, A. A. (2020). Decision tree-based diagnosis of coronary artery disease: CART model. Computer methods and programs in biomedicine, 192, 105400. https://doi.org/10.1016/j.cmpb.2020.105400
Gorenoi, V., Schönermark, M. P., & Hagen, A. (2012). CT coronary angiography vs. invasive coronary angiography in CHD. GMS health technology assessment, 8, Doc02. https://doi.org/10.3205/hta000100
Habehh, H., & Gohel, S. (2021). Machine learning in healthcare. Current genomics, 22(4), 291-300. https://doi.org/10.2174/1389202922666210705124359
Haq, A. U., Li, J. P., Khan, J., Memon, M. H., Nazir, S., Ahmad, S., Khan, G. A., & Ali, A. (2020). Intelligent machine learning approach for effective recognition of diabetes in E-healthcare using clinical data. Sensors, 20(9), 2649. https://doi.org/10.3390/s20092649
Hayta, E., Gencturk, B., Ergen, C., & Koklu, M. (2023). Predicting future demand analysis in the logistics sector using machine learning methods. Intelligent Methods In Engineering Sciences, 2(4), 102-114. https://doi.org/10.58190/imiens.2023.70
Jarman, A. M. (2020). Hierarchical cluster analysis: Comparison of single linkage, complete linkage, average linkage and centroid linkage method. Georgia Southern University, 29, 90240. https://doi.org/10.13140/RG.2.2.11388.90240
Kavitha, S., & Kaulgud, N. (2023). Quantum K-means clustering method for detecting heart disease using quantum circuit approach. Soft Computing, 27(18), 13255-13268. https://doi.org/10.1007/s00500-022-07200-x
Kim, S. (2015). ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Communications for statistical applications and methods, 22(6), 665. https://doi.org/10.5351/CSAM.2015.22.6.665
Koklu, M., & Sabancı, K. (2016). Estimation of credit card customers payment status by using kNN and MLP. International Journal of Intelligent Systems and Applications in Engineering, 4(Special Issue-1), 249-251. https://doi.org/10.18201/ijisae.281901
Krittanawong, C., Virk, H. U. H., Bangalore, S., Wang, Z., Johnson, K. W., Pinotti, R., Zhang, H., Kaplin, S., Narasimhan, B., & Kitai, T. (2020). Machine learning prediction in cardiovascular diseases: a meta-analysis. Scientific reports, 10(1), 16057. https://doi.org/10.1038/s41598-020-72685-1
Marutho, D., Handaka, S. H., & Wijaya, E. (2018). The determination of cluster number at k-mean using elbow method and purity evaluation on headline news 2018 international seminar on application for technology of information and communication,
Mooney, S. J., & Pejaver, V. (2018). Big data in public health: terminology, machine learning, and privacy. Annual review of public health, 39(1), 95-112. https://doi.org/10.1146/annurev-publhealth-040617-014208
Morgenstern, J. D., Buajitti, E., O’Neill, M., Piggott, T., Goel, V., Fridman, D., Kornas, K., & Rosella, L. C. (2020). Predicting population health with machine learning: a scoping review. BMJ open, 10(10), e037860. https://doi.org/10.1136/bmjopen-2020-037860
Muthumani, N., & Akilandeswari, K. (2024). Optimized Feature Selection and Classification Framework for Cardiovascular Disease Using Statistical Normalization and Bio-Inspired Algorithms 2024 International Conference on Communication, Control, and Intelligent Systems (CCIS),
Na, S., Xumin, L., & Yong, G. (2010). Research on k-means clustering algorithm: An improved k-means clustering algorithm 2010 Third International Symposium on intelligent information technology and security informatics,
Nabel, E. G. (2003). Cardiovascular disease. New England Journal of Medicine, 349(1), 60-72. https://doi.org/10.1056/NEJMra035098
Nadeem, M. W., Goh, H. G., Khan, M. A., Hussain, M., & Mushtaq, M. F. (2021). Fusion-Based Machine Learning Architecture for Heart Disease Prediction. Computers, Materials and Continua, 67(2), 2481-2496. https://doi.org/10.32604/cmc.2021.014649
Ni, Z., Liu, K., & Kang, G. (2018). Research on cardiovascular disease prediction based on distance metric learning Journal of Physics: Conference Series,
Patro, S., & Sahu, K. K. (2015). Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462. https://doi.org/10.48550/arXiv.1503.06462
Popp, R. L. (1976). Echocardiographic assessment of cardiac disease. Circulation, 54(4), 538-552. https://doi.org/10.1161/01.CIR.54.4.538
Prabhakaran, D., Jeemon, P., & Roy, A. (2016). Cardiovascular diseases in India: current epidemiology and future directions. Circulation, 133(16), 1605-1620. https://doi.org/10.1161/CIRCULATIONAHA.114.008729
Prasetyo, S. Y., Kurniawan, A., Sihotang, E. F. A., Puspita, R., & Setiawan, K. E. (2023). Heart disease risk prediction using K-nearest neighbor: A study of Euclidean and cosine distance metrics 2023 3rd International Conference on Smart Cities, Automation & Intelligent Computing Systems (ICON-SONICS),
Priyadarshinee, S., & Panda, M. (2022). Improving prediction of chronic heart failure using smote and machine learning 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA),
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
Saritas, M. M., Kursun, R., & Koklu, M. (2025, 06–07 October 2025). Detection of Bone Fractures in X-ray Images with Machine Learning Methods Using InceptionV3 Deep Features 3rd International Conference on Pioneer and Innovative Studies (ICPIS 2025),
Shah, D., Patel, S., & Bharti, S. K. (2020). Heart disease prediction using machine learning techniques. SN Computer Science, 1(6), 345. https://doi.org/10.1007/s42979-020-00365-y
Sharma, S., & Batra, N. (2019). Comparative study of single linkage, complete linkage, and ward method of agglomerative clustering 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon),
Shilaskar, S., & Ghatol, A. (2013). Feature selection for medical diagnosis: Evaluation for cardiovascular diseases. Expert systems with applications, 40(10), 4146-4153. https://doi.org/10.1016/j.eswa.2013.01.032
Shrivastava, A., Chakkaravarthy, M., & Shah, M. A. (2023). A new machine learning method for predicting systolic and diastolic blood pressure using clinical characteristics. Healthcare Analytics, 4, 100219. https://doi.org/10.1016/j.health.2023.100219
Singh, A., & Kumar, R. (2020). Heart disease prediction using machine learning algorithms 2020 international conference on electrical and electronics engineering (ICE3),
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 38(6), 1409–1438. https://sid.ir/paper/549615/en
Spencer, R., Thabtah, F., Abdelhamid, N., & Thompson, M. (2020). Exploring feature selection and classification methods for predicting heart disease. Digital health, 6, 2055207620914777. https://doi.org/10.1177/2055207620914777
Sumwiza, K., Twizere, C., Rushingabigwi, G., Bakunzibake, P., & Bamurigire, P. (2023). Enhanced cardiovascular disease prediction model using random forest algorithm. Informatics in Medicine Unlocked, 41, 101316. https://doi.org/10.1016/j.imu.2023.101316
Taspinar, Y. S., Cinar, I., & Koklu, M. (2022). Classification by a stacking model using CNN features for COVID-19 infection diagnosis. Journal of X-ray science and technology, 30(1), 73-88. https://doi.org/10.3233/XST-211031
Taspinar, Y. S., Cinar, I., Kursun, R., & Koklu, M. (2024). Monkeypox Skin Lesion Detection with Deep Learning Models and Development of Its Mobile Application. Public health, 500, 5.
Upadhyay, S., Dwivedi, A., Verma, A., & Tiwari, V. (2023). Heart disease prediction model using various supervised learning algorithm 2023 IEEE 12th International Conference on Communication Systems and Network Technologies (CSNT),
Ünal, Y., Ekim, U., & Köklü, M. (2011). Unıversıte Ogrencılerın Ortak Zorunlu Derslerdekı Basarılarının K-Means Algorıtması Ile Incelenmesı. Engineering Sciences, 6(1), 342-347. https://doi.org/10.12739/nwsaes.v6i1.5000067037
Yasin, E., & Koklu, M. (2025). A comparative analysis of machine learning algorithms for waste classification: inceptionv3 and chi-square features. International Journal of Environmental Science and Technology, 22(10), 9415-9428. https://doi.org/10.1007/s13762-024-06233-z
Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The quantitative methods for psychology, 11(1), 8-21. https://doi.org/10.20982/tqmp.11.1.p008
Zhao, D., Liu, J., Wang, M., Zhang, X., & Zhou, M. (2019). Epidemiology of cardiovascular disease in China: current features and implications. Nature Reviews Cardiology, 16(4), 203-212. https://doi.org/10.1038/s41569-018-0119-4
Zhou, H., Deng, Z., Xia, Y., & Fu, M. (2016). A new sampling method in particle filter based on Pearson correlation coefficient. Neurocomputing, 216, 208-215. https://doi.org/10.1016/j.neucom.2016.07.036
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Ibrahim Akbas, Yavuz Selim Taspinar, Murat Koklu

This work is licensed under a Creative Commons Attribution 4.0 International License.



