Social Identity Concept Adjustment in Hate Speech Corpus: A Computational Linguistics Approach
Keywords:
hate speech, computational linguistics, semantic domain, natural language processingAbstract
The identification of hate speech must be accompanied by the identification of social identity concepts. This study aims to provide an alternative corpus with text metadata and social identity based on relevant laws that are designed to be implemented in machine learning. Two key questions are addressed: what social identity semantic domains are realized in the corpus, and what are the accuracy measurement results from the corpus? To achieve these aims, the study adopts a mixed-methods approach: qualitative for the first question and quantitative for the second. This research falls under the broader umbrella of computational linguistics, utilizing semantic domain theory and natural language processing. The first approach shows that the corpus only contributes five out of nine formulated domains, dominated by negative (uncategorized), religion, and ethnicity. The second approach indicates suboptimal conditions in the annotation distribution of the corpus, despite an average accuracy rate of over 80%. This condition limits the model’s ability to generalize beyond the information within the corpus, especially regarding social identity categories that are not fully represented. This study differs from previous ones by focusing on data categorization based on more up-to-date legal sources. Future research could elaborate on this work by incorporating new language use concepts aligned with the corpus's original goal to detect hate speech.
References
Aggarwal, S., & Vishwakarma, D. K. (2024). Exposing the Achilles’ heel of textual hate speech classifiers using indistinguishable adversarial examples. Expert Systems with Applications, 254(October 2023), 124278. https://doi.org/10.1016/j.eswa.2024.124278
Ahnaf, M. I., & Suhadi. (2014). Isu-isu kunci ujaran kebencian (hate speech): Implikasinya terhadap gerakan sosial membangun toleransi. Jurnal Multikultural & Multireligius, 13(3), 153–164. Retrieved from http://www.youtube.com/
Allan, K. (2001). Natural language semantics. Oxford: Blackwell.
Bakar, N. A., Zahid, I., Jaafar, M. F., & Ali, W. Z. K. W. (2024). The mapping and classification of shariah’s semantic domain based on semantic relations of Arabic loanwords lexical. Pertanika Journal of Social Sciences and Humanities, 32, 1–27. https://doi.org/10.47836/PJSSH.32.S1.01
Cortés, I. (2021). Hate speech, symbolic violence, and racial discrimination. Antigypsyism: What responses for the next decade? Social Sciences, 10(10). https://doi.org/10.3390/socsci10100360
Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (Fifth ed). Los Angeles, London, New Delhi, Singapore, Washington DC, Melbourne: SAGE Publications, Inc.
Fonseca, A., Pontes, C., Moro, S., Batista, F., Ribeiro, R., Guerra, R., Carvalho, P., Marques, C., &
Silva, C. (2024). Analyzing hate speech dynamics on Twitter/X: Insights from conversational data and the impact of user interaction patterns. Heliyon, 10(11), e32246. https://doi.org/10.1016/j.heliyon.2024.e32246
Ghasiya, P., Ahnert, G., & Sasahara, K. (2023). Identifying themes of Right-Wing extremism in Hindutva discourse on Twitter. Social Media and Society, 9(3). https://doi.org/10.1177/20563051231199457
Gliozzo, A., & Strapparava, C. (2009). Semantic domains in computational linguistics. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-68158-8
Gumilang, M. A., Abdillah, F., Amin, M. Y., & Hasan, M. (2024). Sentiment analysis of Indonesian ministries social media: Citizen responses utilizing TextBlob analyser. Jurnal Sosioteknologi, 23(2), 203–216. https://doi.org/10.5614/sostek.itbj.2024.23.2.5
Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: a Large-Scale Machine-Generated dataset for adversarial and implicit hate speech detection. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.234
Hausser, R. (2014). Foundations of computational linguistics, human-computer communication in
natural language (Third Edit). Springer.
Hu, S., Zhang, H., & Zhang, W. (2023). Domain knowledge graph question answering based on semantic
analysis and data augmentation. Applied Sciences (Switzerland), 13(15). https://doi.org/10.3390/app13158838
Huang, H., Lin, D. K. J., Liu, M., & Yang, J. (2015). Computer experiments with both qualitative and quantitative variables. Technometrics, 58(4), 495–507. https://doi.org/10.1080/00401706.2015.1094416
Ibrahim, Y. M., Essameldin, R., & Darwish, S. M. (2024). An adaptive hate speech detection approach using neutrosophic neural networks for social media forensics. Computers, Materials & Continua/Computers, Materials & Continua (Print), 79(1), 243–262. https://doi.org/10.32604/cmc.2024.047840
Ibrohim, M. O., & Budi, I. (2019). Multi-label hate speech and abusive language detection in Indonesian twitter. Proceedings of the Third Workshop on Abusive Language Online, 46–57. Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-3506
Ibrohim, M. O., & Budi, I. (2023). Hate speech and abusive language detection in Indonesian social
media: Progress and challenges. Heliyon, 9(8), e18647. https://doi.org/10.1016/j.heliyon.2023.
e18647
Jahan, M. S., & Oussalah, M. (2023). A systematic review of hate speech automatic detection using natural language processing. Neurocomputing, 546(17), 126232. https://doi.org/10.1016/j. neucom.2023.126232
Khanduja, N., Kumar, N., & Chauhan, A. (2024). Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation. Systems and Soft Computing, 6(June), 200112. https://doi.org/10.1016/j.sasc.2024.200112
Kurniasih, D. (2019). Ujaran kebencian di ruang publik: Analisis pragmatik pada data pusat studi agama dan perdamaian (PSAP) Solo Raya. Jurnal Studi Agama Dan Masyarakat, 15(1), 49–57. https://doi.org/10.23971/jsam.v15i1.1153
Kurniawati, R. (2023). Buzzer sebagai alat politik ditinjau dari perspektif penegakan hukum di Indonesia. Justicia Sains: Jurnal Ilmu Hukum, 8(2), 260–275. https://doi.org/10.24967/jcs.v8i2.2313
Leavy, P. (2017). Research design: Quantitative, qualitative, mixed methods, art-based, and community-based participatory research approaches. New York, London: The Guildford Press.
Mu, Y., Yang, J., Li, T., Li, S., & Liang, W. (2024). HA-GCEN: Hyperedge-abundant graph convolutional enhanced network for hate speech detection. Knowledge-Based Systems, 300 (November 2023), 112166. https://doi.org/10.1016/j.knosys.2024.112166
Neyasyah, M. S. (2020). Legal resilience in the phenomenon of social media political buzzer in Indonesia. 130(Iclave 2019), 338–344. https://doi.org/10.2991/aebmr.k.200321.044
Niemi, P. M., Benjamin, S., Kuusisto, A., & Gearon, L. (2018). How and why education counters ideological extremism in Finland. Religions, 9(12), 1–16. https://doi.org/10.3390/REL9120420
Oktavianus, O. (2022). Hate speech and local cultural values in Indonesia. Proceedings of the International Congress of Indonesian Linguistics Society (KIMLI 2021), 622(Kimli), 151–155. https://doi.org/10.2991/assehr.k.211226.031
Pan, R., García-Díaz, J. A., & Valencia-García, R. (2024). Comparing fine-tuning, zero and few-shot strategies with large language models in hate speech detection in English. CMES - Computer Modeling in Engineering and Sciences, 140(3), 2849–2868. https://doi.org/10.32604/cmes.2024.049631
Perifanos, K., Goutsos, D., Montes-Y-Gómez, M., & Rosso, P. (2021). Multimodal technologies and interaction multimodal hate speech detection in Greek social media. Retrieved from https://doi.org/10.3390/mti5070034
Raza, S., & Chatrath, V. (2024). HarmonyNet: Navigating hate speech detection. Natural Language Processing Journal, 8(August), 100098. https://doi.org/10.1016/j.nlp.2024.100098
Saeed, J. I. (2016). Semantics (4th ed.). West Sussex: Blackwell.
Safari, P., & Shamsfard, M. (2024). Data augmentation and preparation process of PerInfEx: A Persian chatbot with the ability of information extraction. IEEE Access, 12, 19158–19180. https://doi.org/10.1109/ACCESS.2024.3360863
Sazali, H., Rahim, U. A., Farady Marta, R., & Gatcho, A. R. (2022). Mapping hate speech about religion and state on social media in Indonesia. Communicatus: Jurnal Ilmu Komunikasi, 6(July), 189–208. https://doi.org/10.15575/cjik.v6i2.
Septiawan, Y., & Chairani. (2023). Perbandingan akurasi metode deteksi ujaran kebencian dalam postingan Twitter menggunakan metode SVM dan Decision Trees yang dioptimalkan dengan Adaboost. Jurnal Teknika, 17(2), 297–299.
Shim, H., Lowet, D., Luca, S., & Vanrumste, B. (2021). LETS: A label-efficient training scheme for
aspect-based sentiment analysis by using a pre-trained language model. IEEE Access, 9, 115563–115578. https://doi.org/10.1109/ACCESS.2021.3101867
Sirulhaq, A., Yuwono, U., & Muta’ali, A. (2023). Lack of critical approach in the hate speech research as ideological practice in Indonesia. SHS Web of Conferences, 173, 04004. https://doi.org/10.1051/shsconf/202317304004
Taradhita, D. A. N., & Putra, I. K. G. D. (2021). Hate speech classification in Indonesian language tweets
by using convolutional neural network. Journal of ICT Research and Applications, 14(3), 225–239.
https://doi.org/10.5614/itbj.ict.res.appl.2021.14.3.2
Tareq, M., Islam, M. F., Deb, S., Rahman, S., & Mahmud, A. A. (2023). Data-augmentation for Bangla-English code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding. IEEE Access, 11, 51657–51671. https://doi.org/10.1109/access.2023.3277787
Undang-undang (UU) Nomor 1 Tahun 2024 tentang Perubahan Kedua atas Undang-Undang Nomor 11 Tahun 2008 tentang Informasi dan Transaksi Elektronik. , Pub. L. No. 1 (2024).
Undang-undang (UU) Nomor 19 Tahun 2016 tentang Perubahan atas Undang-Undang Nomor 11 Tahun 2008 Tentang Informasi Dan Transaksi Elektronik. , Pub. L. No. 19 (2016).
United Nation. (2020). United Nations strategy and plan of action on hate speech: Detailed guidance on implementation for United Nations field presences. In United Nations Report. Retrieved from https://www.un.org/en/genocideprevention/documents/UN Strategy and PoA on Hate Speech_Guidance on Addressing in field.pdf
Wasilewski, K. (2019). Hate speech and identity politics: An intercultural communication perspective.
Przegląd Europejski, 3(3), 175–187. https://doi.org/10.5604/01.3001.0013.5848
Yang, S., Kong, D., & He, J. (2025). Social identity or social capital : Local CEOs and corporate. International Review of Economics and Finance, 98(August 2024), 103926. https://doi.org/10.1016/j.iref.2025.103926
Zahid, I. (2020). Semantics domain, verbs and collocation in women’s beauty product advertisements. Issues in Language Studies, 9(1), 28–50. https://doi.org/10.33736/ils.1797.2020
Zahid, I., Bakar, N. A., Kamaruddin, W. Z., Ali, W., & Jusoff, K. (2022). Pemetaan domain semantik akidah: Penyelesaian kekaburan makna. Gjat, 12(2 1 83), 1–20. Retrieved from www.gjat.my
Zakariya, I., & Syafrullah, M. (2024). Implementasi text mining untuk deteksi ujaran kebencian terhadap Ibu Kota Nusantara menggunakan algorima K-Nearest Neighbors pada platform X. 5th Seminar Nasional Mahasiswa Fakulltas Teknologi Informasi (SENAFTI), 3(3), 263–270.
Published
Issue
Section
Copyright (c) 2025 Fauzan Novaldy Pratama, Andika Dutha Bachari, Zainul Muttaqin, Heri Heryono, Dinda Noor Azizah

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.