Credit Card Fraud Detection: A Comparative Study of Distance Metrics in  Machine Learning

Vaishnav Menon; R Jai Akash; Niveditta Batra

Authors

Vaishnav Menon Department of Data Sciences and Analytics, Ramaiah University of Applied Sciences, Bengaluru, 560054, Karnataka, India. https://orcid.org/0009-0000-5747-9583
R Jai Akash Department of Data Sciences and Analytics, Ramaiah University of Applied Sciences, Bengaluru, 560054, Karnataka, India. https://orcid.org/0009-0000-5747-9583
Niveditta Batra Department of CSE-IT, Jaypee Institute of Information Technology, Noida, 201304, Uttar Pradesh, India. https://orcid.org/0009-0000-5747-9583

Keywords:

Fraud detection, unsupervised learning, k-means clustering, distance metrics, class imbalance, Credit Card Fraud detection

Abstract

Background: Credit card fraud detection is a critical problem due to the increasing volume of online transactions and the high costs associated with fraudulent activities. Previous studies in this field have investigated various machine-learning techniques to identify fraudulent transactions, with notable progress made through supervised learning methods. However, these models often face challenges due to the significant class imbalance in fraud detection datasets, where instances of fraud are much less frequent than legitimate transactions.

Purpose: As a result, there is growing interest in unsupervised techniques, such as clustering algorithms, which do not depend on labeled data and may offer improved generalization to new and unseen fraud patterns. These unsupervised approaches can autonomously identify anomalies by grouping transactions based on shared characteristics, making them a valuable alternative for detecting evolving fraudulent activities.

Methods: This work explores different distance metrics in clustering algorithms such as K-Means to identify fraudulent activity in a credit card dataset. The substantial class imbalance is highlighted by the European credit card transactions dataset, which consists of only 0.17% of fraudulent transactions. The research utilizes multiple sampling techniques to address class imbalance.

Results: The study found that the Euclidean distance metric produced the best results out of all potential techniques when applied to the K-Means algorithm. It emphasizes how crucial it is to deal with class disparities and use unsupervised methods for fraud detection in practical settings.

Conclusions: In future research, there is scope for improvements in fraud detection systems, particularly in terms of finding enhanced algorithms and expanding data availability.

Downloads

Download data is not yet available.

Publication Facts

This article

Author statements

Data availability

N/A

16%

External funding

No

32%

Competing interests

No

11%

This journal

Other journals

Articles accepted

88%

33%

Days to publication

95

145

Indexed in

GS

Editor & editorial board: profiles
Academic society: Graduate Journal of Interdisciplinary Research, Reports & Reviews
Publisher: Vyom Hans Journals

References

Rajeshwari, U., and B. Sathish Babu. "Real-time credit card fraud detection using streaming analytics." In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp. 439-444. IEEE, 2016. https://doi.org/10.1109/ICATCCT.2016.7912039.

Dataset: Andrea, “Credit Card fraud detec tion- Anonymized credit card transactions la beled as fraudulent or genuine”- Kaggle(2018). https://www.kaggle.com/.

Pranckevičius, Tomas, and Virginijus Marcinkevičius. "Comparison of naïve bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification." Baltic Journal of Modern Computing 5, no. 2 (2017): 221. https://doi.org/10.22364/bjmc.2017.5.2.05.

Sahin, Yusuf, Serol Bulkan, and Ekrem Duman. "A cost-sensitive decision tree approach for fraud detec tion." Expert Systems with Applications 40, no. 15 (2013): 5916-5923.

Awoyemi, John O., Adebayo O. Adetunmbi, and Samuel A. Oluwadare. "Credit card fraud detection using machine learning techniques: A comparative analysis." In 2017 international conference on com puting networking and informatics (ICCNI), pp. 1-9. IEEE, 2017.

Lakshmi, S. V. S. S., and Selvani Deepthi Kavilla. "Machine learning for credit card fraud detection system." International Journal of Applied Engineering Research 13, no. 24 (2018): 16819-16824.

Ghosh, Sushmito, and Douglas L. Reilly. "Credit card fraud detection with a neural-network." In System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on, vol. 3, pp. 621 630. IEEE, 1994.

Raj, S. Benson Edwin, and A. Annie Portia. "Analysis on credit card fraud detection methods." In 2011 In ternational Conference on Computer, Communication and Electrical Technology (ICCCET), pp. 152-156. IEEE, 2011.

Ileberi, E., Sun, Y. & Wang, Z. A machine learning based credit card fraud detection using the GA algorithm for feature selection. J Big Data 9, 24 (2022).

Alonge, Dayo. "Credit Card Fraud Detection Using Machine Learning (AI)." (2023). https://www.researchgate.net/

Lavado, N., & Calapez, T. (2011). Principal Components Analysis with Spline Optimal Transformations for Continuous Data. IAENG International Journal of Applied Mathematics,41(4).

Maćkiewicz, A., & Ratajczak, W. (1993). Principal components analysis (PCA).Computers & Geo sciences19(3),303-342. https://doi.org/10.1016/0098 3004(93)90090-R.

Ding, C., & He, X. (2004, July). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning,29.

Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903-4923.

Dutta, P., Paul, S., & Majumder,M. (2021). An efficient SMOTE based machine learning classification for prediction & detection of PCOS. https://www.researchsquare.com/article/rs 1043852/v1.

Mohammed,R.,Rawashdeh,J.,& Abdullah,M.(2020, April). Machine learning with oversampling and under sampling techniques: overview study and experimental results. 2020 11th international conference on information and communication systems (ICICS), 243-248, IEEE.

Wongvorachan, T., He, S., & Bulut,O. (2023). A comparison of undersampling, oversampling, and SMOTEmethods for classification dealing in with im balanced educational data mining. Information, 14(1), 54, https://doi.org/10.3390/info14010054.

Mansourifar, H., & Shi, W. (2020). Deep synthetic minority over-sampling https://arxiv.org/abs/2003.09788. technique.

Mahin, M., Islam, M. J., Khatun, A., & Debnath, B. C. (2018, December). A comparative study of distance metric learning to find sub-categories of minority class from imbalance data. international conference on innovation in engineering and technology(ICIET), 1-6, IEEE.

Chawla, N. V., Bowyer, K. W., Hall,L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research,16, 321-357.

Yen, S. J., & Lee, Y. S. (2009). Cluster-based under sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718 5727. https://doi.org/10.1016/j.eswa.2008.06.108.

Singh, A., Yadav, A., & Rana, A.(2013). K-means with Three different Distance Metrics. International Journal of Computer Applications, 67(10).

Azevedo, B. F., Rocha, A. M. A.,& Pereira, A. I. (2023, September). A Collaborative Multi-objective Approach for Clustering Task Based on Distance Measures and Clustering Validity Indices. Interna tional Conference on the Dynamics of Information Systems, 54-68, Cham: Springer Nature Switzerland, https://doi.org/10.1007/978-3-031-50320-7_4.

Jin, K., Li, Y., & Tang, Y. (2023,October). The Application of PCA and K-means Model in Detecting Anomalous Transformers. IEEE 3rd International Conference on Data Science and Computer Application(ICDSCA),690-695,IEEE, https://doi.org/10.1109/ICDSCA59871.2023.10393265.

Sharma, tance P. metrics (2020). used Understanding in machine ing. https://www.analyticsvidhya.com/

Lu, B., Charlton, M., Brunsdon, C.,& Har ris, P. (2016). The Minkowski approach for choosing the distance metric in geographically weighted regression. International Journal of Ge ographical Information Science, 30(2),351-368, https://doi.org/10.1080/13658816.2015.1087001.

Moghtadaiee, V., & Dempster, A. G. (2015). Determining the best vector distance measure for Pervasive use in location f ingerprinting. and Mobile Computing, 23,59 79. https://doi.org/10.1016/j.pmcj.2014.11.002.

Elgamel, M. S., & Dandoush, A.(2015). A modified Manhattan distance with application for localization algorithms in ad-hoc WSNs. Ad Hoc Networks, 33, 168-189.

Faisal, M., & Zamzami, E. M.(2020,June).inter centroid Comparative analysis of K-Means performance using euclidean distance, canberra distance and manhattan distance. Journal of Physics: Conference Series 1566(1), 12112, IOP Publishing.

Crasta, G., & Malusa, A. (2007). The distance function from the boundary in a Minkowski space. Transactions of the American Mathematical Society, 359(12), 5725-5759. https://doi.org/10.1090/S00029947-07-04260-2.

Kumar, R. (2017, October). Analysis of shape alignment using Euclidean and Manhattan distance metrics. International Conference on Recent Innovations in Signal processing and Embedded Systems (RISE),326 331,IEEE. https://doi.org/10.1109/RISE.2017.8378175.

Elmore, K. L., & Richman, M. B.(2001). Euclidean distance as a similarity metric for principal component analysis. Monthly weather review, 129(3), 540-549.

Dokmanic, I., Parhizkar, R.,Ranieri, J., & Vetterli, M. (2015). Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine,32(6),12-30. https://doi.org/10.1109/MSP.2015.2398954

Gower, J. C. (1985). Properties of Euclidean and non Euclidean distance matrices. Linear algebra and its applications, 67,81-97. https://doi.org/10.1016/00243795(85)90187-9

Syakur, M. A., Khotimah, B. K.,Rochman, E. M. S., & Satoto, B. D. (2018, April). Integration k-means clustering method and elbow method for identification of the best customer profile cluster. IOP conference series: materials science and engineering, 336,12017, IOP Publishing.

Ikotun, A. M., Ezugwu, A. E.,Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sci ences, 622, 178-210.

Dabbura, I. (2018). K-means clustering: Algorithm, applications, evaluation methods, and draw backs. https://towardsdatascience.com

Rojas, J. A. R., Kery, M. B.,Rosenthal, S., & Dey, A. (2017, October). Sampling techniques to improve big data exploration. IEEE 7th symposium on Large Data Analysis and Visualization (LDAV),26-35, IEEE.

Koepsell, T. D., Martin, D. C.,Diehr, P. H., Psaty, B. M., Wagner, E. H., Perrin,E. B., & Cheadle, A. (1991). Data analysis and sample size issues in evaluations of community-based health promotion and disease pre vention programs: a mixed-model analysis of variance approach. Journal of clinical epidemiology, 44(7),701 713. https://doi.org/10.1016/0895-4356(91)90030-D.

Li, K., Cao, X., Ge, X., Wang, F.,Lu, X., Shi, M., ... & Chang, S. (2020). Meta-heuristic optimization based two-stage residential load pattern clustering approach considering intra-cluster compactness and inter-cluster separation. IEEE Transactions on Industry Applications, 56(4), 3375-3384.

Gharaei, N., Bakar, K. A., Hashim,S. Z. M., & Pourasl, A. H. (2019). Inter-and intra-cluster movement of mobile sink algorithms for cluster-based networks to enhance the network lifetime. Ad Hoc Networks, 85, 60-70. https://doi.org/10.1016/j.adhoc.2018.10.020

dos Santos, T. R., & Zárate, L. E.(2015). Categori cal data clustering: What similarity measure to rec ommend?. Expert Systems with Applications,42(3), 1247-1260.

Kanungo, T., Mount, D. M., Netanyahu, N. S., Pi atko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analy sis and implementation. IEEE transactions on pat tern analysis and machine intelligence, 24(7),881-892. https://doi.org/10.1109/TPAMI.2002.1017616.

Na, S., Xumin, L., & Yong, G.(2010, April). Research on k-means clustering algorithm: An improved k-means clustering algorithm. Third International Symposium on intelligent information technology and security informatics, 63-67, IEEE.

Bouhmala, N. (2016, July). How good is the euclidean distance metric for the clustering problem. 5th IIAI international congress on advanced applied informatics (IIAI-AAI),312-315,IEEE. https://doi.org/10.1109/IIAI-AAI.2016.26

Zahra, S., Ghazanfar, M. A.,Khalid, A., Azam, M. A., Naeem, U., & Prugel-Bennett, A. (2015). Novel centroid selection approaches for K Means-clustering systems. Information based recommender sciences,320,156-189. https://doi.org/10.1016/j.ins.2015.03.062.

Leisch,F.(2006).A tool box fork-centroids cluster analysis. Computational statistics & data analysis,51(2), 526-544.

Fashoto, S. G., Owolabi, O.,Adeleye, O., & Wandera, J. (2016). Hybrid methods for credit card fraud detection using K-means clustering with hidden Markov model and multilayer perceptron algorithm. https://ir.kiu.ac.ug/.

Jiang, B., Pei, J., Tao, Y., & Lin, X.(2011). Cluster ing uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering, 25(4), 751-763.

Liu, C. L., Chang, T. H., & Li, H.H. (2013). Clustering documents with labeled and unlabeled documents using fuzzy semi-K means. Fuzzy Sets and Systems, 221, 48-64. https://doi.org/10.1016/j.fss.2013.01.004.

Credit Card Fraud Detection: A Comparative Study of Distance Metrics in Machine Learning

Authors

Keywords: