Clustering Performance on Heart Disease Data: Effects of Distance Metrics and Scaling

Authors

DOI:

https://doi.org/10.47134/jtsi.v3i1.5336

Keywords:

Cardiovascular Diseases, Clinical Data Analysis, Unsupervised Machine Learning

Abstract

Cardiovascular diseases (CVD) are one of the leading causes of morbidity and mortality worldwide, requiring advanced analytical approaches to identify early-stage risk groups and classify patient profiles in greater detail. The aim of this study is to reveal latent patient subgroups associated with CVD using unsupervised machine learning methods on clinical data. In this context, a dataset consisting of 11 clinical variables from 303 patients who visited the VA Medical Center in Long Beach, California, was analyzed. During the preprocessing stage, missing observations were eliminated, only numerical variables were used, and both z-score standardization and min–max normalization were applied to the data. Subsequently, hierarchical clustering analyses were performed using single, complete, and average linkage approaches based on Euclidean and cosine distance measures) (the number of possible clusters for different distance–scaling combinations was evaluated using the Elbow and Silhouette measures. The results obtained showed that the 4-cluster solution, particularly under the complete and average linkage methods, represented the data structure in the most clinically explanatory manner. The similarity between the clustering results obtained using the k-means algorithm with Euclidean distance in standardized data and cosine distance in normalized data was calculated as the Rand Index (RI) = 0.8179) (this value demonstrated that the cluster structure was largely preserved despite different distance metrics and scaling strategies.  The findings demonstrate that unsupervised learning-based clustering approaches provide a useful tool for defining meaningful risk classes within heterogeneous patient populations based on clinical datasets and for conducting comparative clinical evaluations between these classes.

References

Ahsan, M. M., Mahmud, M. P., Saha, P. K., Gupta, K. D., & Siddique, Z. (2021). Effect of data scaling methods on machine learning algorithms and model performance. Technologies, 9(3), 52. https://doi.org/10.3390/technologies9030052

Alanazi, A. (2022). Using machine learning for healthcare challenges and opportunities. Informatics in Medicine Unlocked, 30, 100924. https://doi.org/10.1016/j.imu.2022.100924

Ali, L., Rahman, A., Khan, A., Zhou, M., Javeed, A., & Khan, J. A. (2019). An automated diagnostic system for heart disease prediction based on X^2 statistical model and optimally configured deep neural network. IEEE access, 7, 34938-34945. https://doi.org/10.1109/ACCESS.2019.2904800

.

Ali, P. J. M. (2022). Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO-The Scientific Journal of Koya University, 10(1), 85-91. https://doi.org/10.14500/aro.10955

Alsabti, K., Ranka, S., & Singh, V. (1997). An efficient k-means clustering algorithm (43). https://surface.syr.edu/eecs/43

Ambrish, G., Ganesh, B., Ganesh, A., Srinivas, C., & Mensinkal, K. (2022). Logistic regression technique for prediction of cardiovascular disease. Global Transitions Proceedings, 3(1), 127-130. https://doi.org/10.1016/j.gltp.2022.04.008

Bharti, R., Khamparia, A., Shabaz, M., Dhiman, G., Pande, S., & Singh, P. (2021). Prediction of heart disease using a combination of machine learning and deep learning. Computational intelligence and neuroscience, 2021(1), 8387680. https://doi.org/10.1155/2021/8387680

Capotosto, L., Massoni, F., De Sio, S., Ricci, S., & Vitarelli, A. (2018). Early diagnosis of cardiovascular diseases in workers: role of standard and advanced echocardiography. BioMed Research International, 2018(1), 7354691. https://doi.org/10.1155/2018/7354691

Chang, V., Bhavani, V. R., Xu, A. Q., & Hossain, M. (2022). An artificial intelligence model for heart disease detection using machine learning algorithms. Healthcare Analytics, 2, 100016. https://doi.org/10.1016/j.health.2022.100016

Chew, E. Y., Burns, S. A., Abraham, A. G., Bakhoum, M. F., Beckman, J. A., Chui, T. Y., Finger, R. P., Frangi, A. F., Gottesman, R. F., & Grant, M. B. (2025). Standardization and clinical applications of retinal imaging biomarkers for cardiovascular disease: a Roadmap from an NHLBI workshop. Nature Reviews Cardiology, 22(1), 47-63. https://doi.org/10.1038/s41569-024-01060-8

Cinar, I., Taspinar, Y. S., Kursun, R., & Koklu, M. (2022). Identification of corneal ulcers with pre-trained AlexNet based on transfer learning 2022 11th Mediterranean conference on embedded computing (MECO),

DeGuire, J., Clarke, J., Rouleau, K., Roy, J., & Bushnik, T. (2019). Blood pressure and hypertension. Health Rep, 30(2), 14-21. https://doi.org/10.25318/82-003-x201900200002

Erdem, K., Yasin, E., Yıldız, M. B., & Koklu, M. (2024). Classification of Heart Diseases with Ensemble Learning Algorithms. Sinop Üniversitesi Fen Bilimleri Dergisi, 9(2), 369-387. https://doi.org/10.33484/sinopfbd.1458580

Erdem, K., Yıldız, M. B., Yasin, E. T., & Köklü, M. (2023). A Detailed analysis of detecting heart diseases using artificial intelligence methods. Intelligent Methods In Engineering Sciences, 2(4), 115-124. https://doi.org/10.58190/imiens.2023.71

García-Vicente, C., Chushig-Muzo, D., Mora-Jiménez, I., Fabelo, H., Gram, I. T., Løchen, M.-L., Granja, C., & Soguero-Ruiz, C. (2023). Evaluation of synthetic categorical data generation techniques for predicting cardiovascular diseases and post-hoc interpretability of the risk factors. Applied Sciences, 13(7), 4119. https://doi.org/10.3390/app13074119

Gaziano, T., Reddy, K. S., Paccaud, F., Horton, S., & Chaturvedi, V. (2006). Cardiovascular disease. In D. Jamison, J. Breman, & A. Measham (Eds.), Disease Control Priorities in Developing Countries. 2nd edition (2nd ed.). The International Bank for Reconstruction and Development / The World Bank.

Ghiasi, M. M., Zendehboudi, S., & Mohsenipour, A. A. (2020). Decision tree-based diagnosis of coronary artery disease: CART model. Computer methods and programs in biomedicine, 192, 105400. https://doi.org/10.1016/j.cmpb.2020.105400

Gorenoi, V., Schönermark, M. P., & Hagen, A. (2012). CT coronary angiography vs. invasive coronary angiography in CHD. GMS health technology assessment, 8, Doc02. https://doi.org/10.3205/hta000100

Habehh, H., & Gohel, S. (2021). Machine learning in healthcare. Current genomics, 22(4), 291-300. https://doi.org/10.2174/1389202922666210705124359

Haq, A. U., Li, J. P., Khan, J., Memon, M. H., Nazir, S., Ahmad, S., Khan, G. A., & Ali, A. (2020). Intelligent machine learning approach for effective recognition of diabetes in E-healthcare using clinical data. Sensors, 20(9), 2649. https://doi.org/10.3390/s20092649

Hayta, E., Gencturk, B., Ergen, C., & Koklu, M. (2023). Predicting future demand analysis in the logistics sector using machine learning methods. Intelligent Methods In Engineering Sciences, 2(4), 102-114. https://doi.org/10.58190/imiens.2023.70

Jarman, A. M. (2020). Hierarchical cluster analysis: Comparison of single linkage, complete linkage, average linkage and centroid linkage method. Georgia Southern University, 29, 90240. https://doi.org/10.13140/RG.2.2.11388.90240

Kavitha, S., & Kaulgud, N. (2023). Quantum K-means clustering method for detecting heart disease using quantum circuit approach. Soft Computing, 27(18), 13255-13268. https://doi.org/10.1007/s00500-022-07200-x

Kim, S. (2015). ppcor: an R package for a fast calculation to semi-partial correlation coefficients. Communications for statistical applications and methods, 22(6), 665. https://doi.org/10.5351/CSAM.2015.22.6.665

Koklu, M., & Sabancı, K. (2016). Estimation of credit card customers payment status by using kNN and MLP. International Journal of Intelligent Systems and Applications in Engineering, 4(Special Issue-1), 249-251. https://doi.org/10.18201/ijisae.281901

Krittanawong, C., Virk, H. U. H., Bangalore, S., Wang, Z., Johnson, K. W., Pinotti, R., Zhang, H., Kaplin, S., Narasimhan, B., & Kitai, T. (2020). Machine learning prediction in cardiovascular diseases: a meta-analysis. Scientific reports, 10(1), 16057. https://doi.org/10.1038/s41598-020-72685-1

Marutho, D., Handaka, S. H., & Wijaya, E. (2018). The determination of cluster number at k-mean using elbow method and purity evaluation on headline news 2018 international seminar on application for technology of information and communication,

Mooney, S. J., & Pejaver, V. (2018). Big data in public health: terminology, machine learning, and privacy. Annual review of public health, 39(1), 95-112. https://doi.org/10.1146/annurev-publhealth-040617-014208

Morgenstern, J. D., Buajitti, E., O’Neill, M., Piggott, T., Goel, V., Fridman, D., Kornas, K., & Rosella, L. C. (2020). Predicting population health with machine learning: a scoping review. BMJ open, 10(10), e037860. https://doi.org/10.1136/bmjopen-2020-037860

Muthumani, N., & Akilandeswari, K. (2024). Optimized Feature Selection and Classification Framework for Cardiovascular Disease Using Statistical Normalization and Bio-Inspired Algorithms 2024 International Conference on Communication, Control, and Intelligent Systems (CCIS),

Na, S., Xumin, L., & Yong, G. (2010). Research on k-means clustering algorithm: An improved k-means clustering algorithm 2010 Third International Symposium on intelligent information technology and security informatics,

Nabel, E. G. (2003). Cardiovascular disease. New England Journal of Medicine, 349(1), 60-72. https://doi.org/10.1056/NEJMra035098

Nadeem, M. W., Goh, H. G., Khan, M. A., Hussain, M., & Mushtaq, M. F. (2021). Fusion-Based Machine Learning Architecture for Heart Disease Prediction. Computers, Materials and Continua, 67(2), 2481-2496. https://doi.org/10.32604/cmc.2021.014649

Ni, Z., Liu, K., & Kang, G. (2018). Research on cardiovascular disease prediction based on distance metric learning Journal of Physics: Conference Series,

Patro, S., & Sahu, K. K. (2015). Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462. https://doi.org/10.48550/arXiv.1503.06462

Popp, R. L. (1976). Echocardiographic assessment of cardiac disease. Circulation, 54(4), 538-552. https://doi.org/10.1161/01.CIR.54.4.538

Prabhakaran, D., Jeemon, P., & Roy, A. (2016). Cardiovascular diseases in India: current epidemiology and future directions. Circulation, 133(16), 1605-1620. https://doi.org/10.1161/CIRCULATIONAHA.114.008729

Prasetyo, S. Y., Kurniawan, A., Sihotang, E. F. A., Puspita, R., & Setiawan, K. E. (2023). Heart disease risk prediction using K-nearest neighbor: A study of Euclidean and cosine distance metrics 2023 3rd International Conference on Smart Cities, Automation & Intelligent Computing Systems (ICON-SONICS),

Priyadarshinee, S., & Panda, M. (2022). Improving prediction of chronic heart failure using smote and machine learning 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA),

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7

Saritas, M. M., Kursun, R., & Koklu, M. (2025, 06–07 October 2025). Detection of Bone Fractures in X-ray Images with Machine Learning Methods Using InceptionV3 Deep Features 3rd International Conference on Pioneer and Innovative Studies (ICPIS 2025),

Shah, D., Patel, S., & Bharti, S. K. (2020). Heart disease prediction using machine learning techniques. SN Computer Science, 1(6), 345. https://doi.org/10.1007/s42979-020-00365-y

Sharma, S., & Batra, N. (2019). Comparative study of single linkage, complete linkage, and ward method of agglomerative clustering 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon),

Shilaskar, S., & Ghatol, A. (2013). Feature selection for medical diagnosis: Evaluation for cardiovascular diseases. Expert systems with applications, 40(10), 4146-4153. https://doi.org/10.1016/j.eswa.2013.01.032

Shrivastava, A., Chakkaravarthy, M., & Shah, M. A. (2023). A new machine learning method for predicting systolic and diastolic blood pressure using clinical characteristics. Healthcare Analytics, 4, 100219. https://doi.org/10.1016/j.health.2023.100219

Singh, A., & Kumar, R. (2020). Heart disease prediction using machine learning algorithms 2020 international conference on electrical and electronics engineering (ICE3),

Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 38(6), 1409–1438. https://sid.ir/paper/549615/en

Spencer, R., Thabtah, F., Abdelhamid, N., & Thompson, M. (2020). Exploring feature selection and classification methods for predicting heart disease. Digital health, 6, 2055207620914777. https://doi.org/10.1177/2055207620914777

Sumwiza, K., Twizere, C., Rushingabigwi, G., Bakunzibake, P., & Bamurigire, P. (2023). Enhanced cardiovascular disease prediction model using random forest algorithm. Informatics in Medicine Unlocked, 41, 101316. https://doi.org/10.1016/j.imu.2023.101316

Taspinar, Y. S., Cinar, I., & Koklu, M. (2022). Classification by a stacking model using CNN features for COVID-19 infection diagnosis. Journal of X-ray science and technology, 30(1), 73-88. https://doi.org/10.3233/XST-211031

Taspinar, Y. S., Cinar, I., Kursun, R., & Koklu, M. (2024). Monkeypox Skin Lesion Detection with Deep Learning Models and Development of Its Mobile Application. Public health, 500, 5.

Upadhyay, S., Dwivedi, A., Verma, A., & Tiwari, V. (2023). Heart disease prediction model using various supervised learning algorithm 2023 IEEE 12th International Conference on Communication Systems and Network Technologies (CSNT),

Ünal, Y., Ekim, U., & Köklü, M. (2011). Unıversıte Ogrencılerın Ortak Zorunlu Derslerdekı Basarılarının K-Means Algorıtması Ile Incelenmesı. Engineering Sciences, 6(1), 342-347. https://doi.org/10.12739/nwsaes.v6i1.5000067037

Yasin, E., & Koklu, M. (2025). A comparative analysis of machine learning algorithms for waste classification: inceptionv3 and chi-square features. International Journal of Environmental Science and Technology, 22(10), 9415-9428. https://doi.org/10.1007/s13762-024-06233-z

Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The quantitative methods for psychology, 11(1), 8-21. https://doi.org/10.20982/tqmp.11.1.p008

Zhao, D., Liu, J., Wang, M., Zhang, X., & Zhou, M. (2019). Epidemiology of cardiovascular disease in China: current features and implications. Nature Reviews Cardiology, 16(4), 203-212. https://doi.org/10.1038/s41569-018-0119-4

Zhou, H., Deng, Z., Xia, Y., & Fu, M. (2016). A new sampling method in particle filter based on Pearson correlation coefficient. Neurocomputing, 216, 208-215. https://doi.org/10.1016/j.neucom.2016.07.036

Downloads

Published

2026-01-07

How to Cite

Akbas, I., Taspinar, Y., & Koklu, M. (2026). Clustering Performance on Heart Disease Data: Effects of Distance Metrics and Scaling. Journal of Technology and System Information, 3(1), 38. https://doi.org/10.47134/jtsi.v3i1.5336

Issue

Section

Articles

Similar Articles

<< < 1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.