Using Machine Learning to Detect Unauthorized Access in Database's Log Files

Israa Jihad Abed

doi:10.37899/journallamultiapp.v5i6.1538

Israa Jihad Abed University of Al-Anbar, Iraq

DOI: https://doi.org/10.37899/journallamultiapp.v5i6.1538

Keywords: anomaly detection, log files, machine learning, unauthorized access

Abstract

The paper investigates the use of machine learning techniques to detect unauthorized access in database log files. Results show that most algorithms of supervised machine learning performed well in identifying normal cases but struggled to detect anomalies, with the exception of Naïve Bayes and Random Forest which gave mediocre results by identifying one out of twenty anomalies. In the semi-supervised machine learning methods, Local Outlier Factor showed an accuracy of 0.98 in detecting normal cases and 0.7 in detecting anomalies. One Class Support Vector Machine had an accuracy of 0.89 for normal cases and 0.05 for anomalies, while Isolation Forest had an accuracy of 0.98 for normal cases and 0.0 for anomalies. These findings suggest that semi-supervised techniques may be more effective in detecting unauthorized access in database log files.

References

Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 93-104). https://doi.org/10.1145/342009.335388

Cortes, C. (1995). Support-Vector Networks. Machine Learning.

Endres, D. M., & Schindelin, J. E. (2003). A new metric for probability distributions. IEEE Transactions on Information theory, 49(7), 1858-1860. https://doi.org/10.1109/TIT.2003.813506

Goodfellow, I. (2016). Deep learning.

Gowtham, M., & Pramod, H. B. (2021). Semantic query-featured ensemble learning model for SQL-injection attack detection in IoT-ecosystems. IEEE Transactions on Reliability, 71(2), 1057-1074. https://doi.org/10.1109/TR.2021.3124331

Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning, 77(1), 103-123. https://doi.org/10.1007/s10994-009-5119-5

Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Morgan Kaufman Publishing.

Kotenko, I., Saenko, I., & Branitskiy, A. (2018). Framework for mobile Internet of Things security monitoring based on big data processing and machine learning. IEEE Access, 6, 72714-72723. https://doi.org/10.1109/ACCESS.2018.2881998

Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In 2008 eighth ieee international conference on data mining (pp. 413-422). IEEE. https://doi.org/10.1109/ICDM.2008.17

McCallum, A., Nigam, K., & Ungar, L. H. (2000, August). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 169-178). https://doi.org/10.1145/347090.347123

Provost, F., & Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions In: Proc of the 3rd International Conference on Knowledge Discovery and Data Mining.

Rosenberg, I., Shabtai, A., Elovici, Y., & Rokach, L. (2021). Adversarial machine learning attacks and defense methods in the cyber security domain. ACM Computing Surveys (CSUR), 54(5), 1-36. https://doi.org/10.1145/3453158

Saenko, I. B., Kotenko, I. V., & Al-Barri, M. H. (2022). The use of artificial neural networks to detect anomalous behavior of users of data processing centers. Voprosy kiberbezopasnosti, (2), 48.

Saenko, I. B., Kotenko, I. V., & Al-Barri, M. H. (2022, December). Research on the possibilities of detecting anomalous behavior of data center users using machine learning models. In Twentieth National Conference on Artificial Intelligence with International Participation, KII-2022 (Moscow (pp. 232-241).

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471. https://doi.org/10.1162/089976601750264965