Analysis of the Corpus with Naïve Bayes in Determining Sentiment Labeling

M. Arif Aulia; Muhammad Siddik Hasibuan

doi:10.37899/journallamultiapp.v5i4.1465

M. Arif Aulia Computer Science Study Program, State Islamic University of North Sumatra, Indonesia
Muhammad Siddik Hasibuan Computer Science Study Program, State Islamic University of North Sumatra, Indonesia

DOI: https://doi.org/10.37899/journallamultiapp.v5i4.1465

Keywords: Corpus, Naïve Bayes, Pre-Processing, NLP

Abstract

The raw form of data is also an issue that creates a lot of problems while attempting to extract useful insight, thus requiring the use of NLP algorithms for text mining. This paper discusses sentiment analysis, with emphasis on user comments regarding cars on the microblog X that was formerly known as Twitter, work which employs Naïve Bayes Algorithm in text categorisation. The steps involved are the formation of the corpus and use of InsetLexicon dictionary for sentiment analysis with the help of weighted keywords and then going through pre-processing of the text data that includes cleaning, normalization and tokenization. The Naive Bayes algorithm estimates the probability of text under positive or negative sentiment class. The work shows that the “Comfortable” component of car reviews obtained the highest score in terms of recall, precision, and F1-score, which equals 0.83, 0.85, and 0.563, and the second set consists of 87 instances overall including an overall data set accuracy of 71%. The result validates the use of lexicon-based sentiment analysis in specific domain and at the same time exposes the weakness of the Naive Bayes, especially with complex word dependencies. Further studies should incorporate more advanced models and suitable dictionaries which facilitate sentiment analysis in ever-shifting online media settings.

References

A. Yani, D. D., Pratiwi, H. S., & Muhardi, H. (2019). Implementasi Web Scraping untuk Pengambilan Data pada Situs Marketplace. Jurnal Sistem Dan Teknologi Informasi (JUSTIN), 7(4), 257. https://doi.org/10.26418/justin.v7i4.30930

Ali, S. A., Parvin, F., Pham, Q. B., Vojtek, M., Vojteková, J., Costache, R., ... & Ghorbani, M. A. (2020). GIS-based comparative assessment of flood susceptibility mapping using hybrid multi-criteria decision-making approach, naïve Bayes tree, bivariate statistics and logistic regression: a case of Topľa basin, Slovakia. Ecological Indicators, 117, 106620. https://doi.org/10.1016/j.ecolind.2020.106620

Cahyawijaya, S., Lovenia, H., Koto, F., Adhista, D., Dave, E., Oktavianti, S., ... & Fung, P. (2023). Nusawrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. arXiv preprint arXiv:2309.10661. https://doi.org/10.48550/arXiv.2309.10661

Cai, J., Yang, Y., Yang, H., Zhao, X., & Hao, J. (2022). ARIS: a noise insensitive data pre-processing scheme for data reduction using influence space. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6), 1-39. https://doi.org/10.1145/3522592

Dey, L., Chakraborty, S., Biswas, A., Bose, B., & Tiwari, S. (2016). Sentiment Analysis of Review Datasets Using Naive Bayes and K-NN Classifier. International Journal of Information Engineering and Electronic Business, 8(4), 54–62. https://doi.org/10.5815/ijieeb.2016.04.07

Firdaus, R., Asror, I., & Herdiani, A. (2021). Lexicon-Based Sentiment Analysis of Indonesian Language Student Feedback Evaluation. Indonesia Journal on Computing (Indo-JC), 6(1), 1–12. https://doi.org/10.34818/INDOJC.2021.6.1.408

Fitri, V. A., Andreswari, R., & Hasibuan, M. A. (2019). Sentiment Analysis of Social Media Twitter with Case of Anti-LGBT Campaign in Indonesia using Naïve Bayes, Decision Tree, and Random Forest Algorithm. Procedia Computer Science, 161, 765–772. https://doi.org/10.1016/j.procs.2019.11.181

Goel, A., Gautam, J., & Kumar, S. (2016). Real Time Sentiment Analysis of Tweets Using Naive Bayes. 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), 257–261. https://doi.org/10.1109/NGCT.2016.7877424

Isam, H., & Abd Mutalib, M. (2019). Pemanfaatan Analisis Korpus sebagai Teknik Alternatif Pengajaran dan Pembelajaran Tatabahasa. International Journal of Language Education and Applied Linguistics, 9(1), 13–31. https://doi.org/10.15282/ijleal.v9.594

Ismail, A. R., & Hakim, R. B. F. (2023). Implementasi Lexicon Based untuk Analisis Sentimen dalam Menentukan Rekomendasi Pantai di DI Yogyakarta Berdasarkan Data Twitter. Emerging Statistics and Data Science Journal, 1(1), 37–46. https://doi.org/10.20885/esds.vol1.iss.1.art5

Khoirunisa, R. (2020). Penggunaan Natural Language Processing Pada Chatbot Untuk Media Informasi Pertanian. Indonesian Journal of Applied Informatics, 4(2), 55. https://doi.org/10.20961/ijai.v4i2.38688

Mayasari, L., & Indarti, D. (2022). Klasifikasi Topik Tweet Mengenai Covid Menggunakan Metode Multinomial Naïve Bayes dengan Pembobotan TF-IDF. Jurnal Ilmiah Informatika Komputer, 27(1), 43–53. https://doi.org/10.35760/ik.2022.v27i1.6184

Naseem, U., Razzak, I., & Eklund, P. W. (2021). A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications, 80, 35239-35266. https://doi.org/10.1007/s11042-020-10082-6

Nhu, V. H., Shirzadi, A., Shahabi, H., Singh, S. K., Al-Ansari, N., Clague, J. J., ... & Ahmad, B. B. (2020). Shallow landslide susceptibility mapping: A comparison between logistic model tree, logistic regression, naïve bayes tree, artificial neural network, and support vector machine algorithms. International journal of environmental research and public health, 17(8), 2749. https://doi.org/10.3390/ijerph17082749

Nugroho, A., & Religia, Y. (2021). Analisis Optimasi Algoritma Klasifikasi Naive Bayes menggunakan Genetic Algorithm dan Bagging. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(3), 504–510. https://doi.org/10.29207/resti.v5i3.3067

Obiedat, R., Al-Darras, D., Alzaghoul, E., & Harfoushi, O. (2021). Arabic Aspect-Based Sentiment Analysis: A Systematic Literature Review. IEEE Access, 9, 152628–152645. https://doi.org/10.1109/ACCESS.2021.3127140

Piryonesi, S. M., & El-Diraby, T. E. (2020). Role of data analytics in infrastructure asset management: Overcoming data size and quality problems. Journal of Transportation Engineering, Part B: Pavements, 146(2), 04020022. https://doi.org/10.1061/JPEODX.0000175

Prasetyo, V. R., Benarkah, N., & Chrisintha, V. J. (2021). Implementasi Natural Language Processing Dalam Pembuatan Chatbot Pada Program Information Technology Universitas Surabaya. Teknika, 10(2), 114–121. https://doi.org/10.34148/teknika.v10i2.370

Rachman, R., & Handayani, R. N. (2021). Klasifikasi Algoritma Naive Bayes dalam Memprediksi Tingkat Kelancaran Pembayaran Sewa Teras UMKM. Jurnal Informatika, 8(2), 111–122. https://doi.org/10.31294/ji.v8i2.10494

Rezaeian, N., & Novikova, G. (2020). Persian text classification using naive bayes algorithms and support vector machine algorithm. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 8(1), 178-188. http://dx.doi.org/10.52549/ijeei.v8i1.1696

Romano, M., Contu, G., Mola, F., & Conversano, C. (2024). Threshold-based naïve bayes classifier. Advances in Data Analysis and Classification, 18(2), 325-361. https://doi.org/10.1007/s11634-023-00536-8

Sutabri, T., Suryatno, A., Setiadi, D., & Negara, E. S. (2018). Improving Naïve Bayes in Sentiment Analysis For Hotel Industry in Indonesia. 2018 Third International Conference on Informatics and Computing (ICIC), 1–6. https://doi.org/10.1109/IAC.2018.8780444

Tang, X., Li, J., Liu, M., Liu, W., & Hong, H. (2020). Flood susceptibility assessment based on a novel random Naïve Bayes method: A comparison between different factor discretization methods. Catena, 190, 104536. https://doi.org/10.1016/j.catena.2020.104536

Torres-García, A. A., Mendoza-Montoya, O., Molinas, M., Antelis, J. M., Moctezuma, L. A., & Hernández-Del-Toro, T. (2022). Pre-processing and feature extraction. In Biosignal processing and classification using computational learning and intelligence (pp. 59-91). Academic Press. https://doi.org/10.1016/B978-0-12-820125-1.00014-2

Wickramasinghe, I., & Kalutarage, H. (2021). Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation. Soft Computing, 25(3), 2277-2293. https://doi.org/10.1007/s00500-020-05297-6