Social Identity Concept Adjustment in Hate Speech Corpus: A Computational Linguistics Approach

https://doi.org/10.5614/sostek.itbj.2025.24.2.10

Authors

  • Fauzan Novaldy Pratama Education Indonesia Language and Literature, Universitas Pendidikan Indonesia, Bandung, Indonesia
  • Andika Dutha Bachari Linguistics, Universitas Pendidikan Indonesia, Bandung, Indonesia
  • Zainul Muttaqin Linguistics, Universitas Pendidikan Indonesia, Bandung, Indonesia
  • Heri Heryono English Language, Universitas Widyatama, Bandung, Indonesia
  • Dinda Noor Azizah English Language, Universitas Widyatama, Bandung, Indonesia

Keywords:

hate speech, computational linguistics, semantic domain, natural language processing

Abstract

The identification of hate speech must be accompanied by the identification of social identity concepts. This study aims to provide an alternative corpus with text metadata and social identity based on relevant laws that are designed to be implemented in machine learning. Two key questions are addressed: what social identity semantic domains are realized in the corpus, and what are the accuracy measurement results from the corpus? To achieve these aims, the study adopts a mixed-methods approach: qualitative for the first question and quantitative for the second. This research falls under the broader umbrella of computational linguistics, utilizing semantic domain theory and natural language processing. The first approach shows that the corpus only contributes five out of nine formulated domains, dominated by negative (uncategorized), religion, and ethnicity. The second approach indicates suboptimal conditions in the annotation distribution of the corpus, despite an average accuracy rate of over 80%. This condition limits the model’s ability to generalize beyond the information within the corpus, especially regarding social identity categories that are not fully represented. This study differs from previous ones by focusing on data categorization based on more up-to-date legal sources. Future research could elaborate on this work by incorporating new language use concepts aligned with the corpus's original goal to detect hate speech.

References

Aggarwal, S., & Vishwakarma, D. K. (2024). Exposing the Achilles’ heel of textual hate speech classifiers using indistinguishable adversarial examples. Expert Systems with Applications, 254(October 2023), 124278. https://doi.org/10.1016/j.eswa.2024.124278

Ahnaf, M. I., & Suhadi. (2014). Isu-isu kunci ujaran kebencian (hate speech): Implikasinya terhadap gerakan sosial membangun toleransi. Jurnal Multikultural & Multireligius, 13(3), 153–164. Retrieved from http://www.youtube.com/

Allan, K. (2001). Natural language semantics. Oxford: Blackwell.

Bakar, N. A., Zahid, I., Jaafar, M. F., & Ali, W. Z. K. W. (2024). The mapping and classification of shariah’s semantic domain based on semantic relations of Arabic loanwords lexical. Pertanika Journal of Social Sciences and Humanities, 32, 1–27. https://doi.org/10.47836/PJSSH.32.S1.01

Cortés, I. (2021). Hate speech, symbolic violence, and racial discrimination. Antigypsyism: What responses for the next decade? Social Sciences, 10(10). https://doi.org/10.3390/socsci10100360

Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (Fifth ed). Los Angeles, London, New Delhi, Singapore, Washington DC, Melbourne: SAGE Publications, Inc.

Fonseca, A., Pontes, C., Moro, S., Batista, F., Ribeiro, R., Guerra, R., Carvalho, P., Marques, C., &

Silva, C. (2024). Analyzing hate speech dynamics on Twitter/X: Insights from conversational data and the impact of user interaction patterns. Heliyon, 10(11), e32246. https://doi.org/10.1016/j.heliyon.2024.e32246

Ghasiya, P., Ahnert, G., & Sasahara, K. (2023). Identifying themes of Right-Wing extremism in Hindutva discourse on Twitter. Social Media and Society, 9(3). https://doi.org/10.1177/20563051231199457

Gliozzo, A., & Strapparava, C. (2009). Semantic domains in computational linguistics. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-68158-8

Gumilang, M. A., Abdillah, F., Amin, M. Y., & Hasan, M. (2024). Sentiment analysis of Indonesian ministries social media: Citizen responses utilizing TextBlob analyser. Jurnal Sosioteknologi, 23(2), 203–216. https://doi.org/10.5614/sostek.itbj.2024.23.2.5

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: a Large-Scale Machine-Generated dataset for adversarial and implicit hate speech detection. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.234

Hausser, R. (2014). Foundations of computational linguistics, human-computer communication in

natural language (Third Edit). Springer.

Hu, S., Zhang, H., & Zhang, W. (2023). Domain knowledge graph question answering based on semantic

analysis and data augmentation. Applied Sciences (Switzerland), 13(15). https://doi.org/10.3390/app13158838

Huang, H., Lin, D. K. J., Liu, M., & Yang, J. (2015). Computer experiments with both qualitative and quantitative variables. Technometrics, 58(4), 495–507. https://doi.org/10.1080/00401706.2015.1094416

Ibrahim, Y. M., Essameldin, R., & Darwish, S. M. (2024). An adaptive hate speech detection approach using neutrosophic neural networks for social media forensics. Computers, Materials & Continua/Computers, Materials & Continua (Print), 79(1), 243–262. https://doi.org/10.32604/cmc.2024.047840

Ibrohim, M. O., & Budi, I. (2019). Multi-label hate speech and abusive language detection in Indonesian twitter. Proceedings of the Third Workshop on Abusive Language Online, 46–57. Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-3506

Ibrohim, M. O., & Budi, I. (2023). Hate speech and abusive language detection in Indonesian social

media: Progress and challenges. Heliyon, 9(8), e18647. https://doi.org/10.1016/j.heliyon.2023.

e18647

Jahan, M. S., & Oussalah, M. (2023). A systematic review of hate speech automatic detection using natural language processing. Neurocomputing, 546(17), 126232. https://doi.org/10.1016/j. neucom.2023.126232

Khanduja, N., Kumar, N., & Chauhan, A. (2024). Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation. Systems and Soft Computing, 6(June), 200112. https://doi.org/10.1016/j.sasc.2024.200112

Kurniasih, D. (2019). Ujaran kebencian di ruang publik: Analisis pragmatik pada data pusat studi agama dan perdamaian (PSAP) Solo Raya. Jurnal Studi Agama Dan Masyarakat, 15(1), 49–57. https://doi.org/10.23971/jsam.v15i1.1153

Kurniawati, R. (2023). Buzzer sebagai alat politik ditinjau dari perspektif penegakan hukum di Indonesia. Justicia Sains: Jurnal Ilmu Hukum, 8(2), 260–275. https://doi.org/10.24967/jcs.v8i2.2313

Leavy, P. (2017). Research design: Quantitative, qualitative, mixed methods, art-based, and community-based participatory research approaches. New York, London: The Guildford Press.

Mu, Y., Yang, J., Li, T., Li, S., & Liang, W. (2024). HA-GCEN: Hyperedge-abundant graph convolutional enhanced network for hate speech detection. Knowledge-Based Systems, 300 (November 2023), 112166. https://doi.org/10.1016/j.knosys.2024.112166

Neyasyah, M. S. (2020). Legal resilience in the phenomenon of social media political buzzer in Indonesia. 130(Iclave 2019), 338–344. https://doi.org/10.2991/aebmr.k.200321.044

Niemi, P. M., Benjamin, S., Kuusisto, A., & Gearon, L. (2018). How and why education counters ideological extremism in Finland. Religions, 9(12), 1–16. https://doi.org/10.3390/REL9120420

Oktavianus, O. (2022). Hate speech and local cultural values in Indonesia. Proceedings of the International Congress of Indonesian Linguistics Society (KIMLI 2021), 622(Kimli), 151–155. https://doi.org/10.2991/assehr.k.211226.031

Pan, R., García-Díaz, J. A., & Valencia-García, R. (2024). Comparing fine-tuning, zero and few-shot strategies with large language models in hate speech detection in English. CMES - Computer Modeling in Engineering and Sciences, 140(3), 2849–2868. https://doi.org/10.32604/cmes.2024.049631

Perifanos, K., Goutsos, D., Montes-Y-Gómez, M., & Rosso, P. (2021). Multimodal technologies and interaction multimodal hate speech detection in Greek social media. Retrieved from https://doi.org/10.3390/mti5070034

Raza, S., & Chatrath, V. (2024). HarmonyNet: Navigating hate speech detection. Natural Language Processing Journal, 8(August), 100098. https://doi.org/10.1016/j.nlp.2024.100098

Saeed, J. I. (2016). Semantics (4th ed.). West Sussex: Blackwell.

Safari, P., & Shamsfard, M. (2024). Data augmentation and preparation process of PerInfEx: A Persian chatbot with the ability of information extraction. IEEE Access, 12, 19158–19180. https://doi.org/10.1109/ACCESS.2024.3360863

Sazali, H., Rahim, U. A., Farady Marta, R., & Gatcho, A. R. (2022). Mapping hate speech about religion and state on social media in Indonesia. Communicatus: Jurnal Ilmu Komunikasi, 6(July), 189–208. https://doi.org/10.15575/cjik.v6i2.

Septiawan, Y., & Chairani. (2023). Perbandingan akurasi metode deteksi ujaran kebencian dalam postingan Twitter menggunakan metode SVM dan Decision Trees yang dioptimalkan dengan Adaboost. Jurnal Teknika, 17(2), 297–299.

Shim, H., Lowet, D., Luca, S., & Vanrumste, B. (2021). LETS: A label-efficient training scheme for

aspect-based sentiment analysis by using a pre-trained language model. IEEE Access, 9, 115563–115578. https://doi.org/10.1109/ACCESS.2021.3101867

Sirulhaq, A., Yuwono, U., & Muta’ali, A. (2023). Lack of critical approach in the hate speech research as ideological practice in Indonesia. SHS Web of Conferences, 173, 04004. https://doi.org/10.1051/shsconf/202317304004

Taradhita, D. A. N., & Putra, I. K. G. D. (2021). Hate speech classification in Indonesian language tweets

by using convolutional neural network. Journal of ICT Research and Applications, 14(3), 225–239.

https://doi.org/10.5614/itbj.ict.res.appl.2021.14.3.2

Tareq, M., Islam, M. F., Deb, S., Rahman, S., & Mahmud, A. A. (2023). Data-augmentation for Bangla-English code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding. IEEE Access, 11, 51657–51671. https://doi.org/10.1109/access.2023.3277787

Undang-undang (UU) Nomor 1 Tahun 2024 tentang Perubahan Kedua atas Undang-Undang Nomor 11 Tahun 2008 tentang Informasi dan Transaksi Elektronik. , Pub. L. No. 1 (2024).

Undang-undang (UU) Nomor 19 Tahun 2016 tentang Perubahan atas Undang-Undang Nomor 11 Tahun 2008 Tentang Informasi Dan Transaksi Elektronik. , Pub. L. No. 19 (2016).

United Nation. (2020). United Nations strategy and plan of action on hate speech: Detailed guidance on implementation for United Nations field presences. In United Nations Report. Retrieved from https://www.un.org/en/genocideprevention/documents/UN Strategy and PoA on Hate Speech_Guidance on Addressing in field.pdf

Wasilewski, K. (2019). Hate speech and identity politics: An intercultural communication perspective.

Przegląd Europejski, 3(3), 175–187. https://doi.org/10.5604/01.3001.0013.5848

Yang, S., Kong, D., & He, J. (2025). Social identity or social capital : Local CEOs and corporate. International Review of Economics and Finance, 98(August 2024), 103926. https://doi.org/10.1016/j.iref.2025.103926

Zahid, I. (2020). Semantics domain, verbs and collocation in women’s beauty product advertisements. Issues in Language Studies, 9(1), 28–50. https://doi.org/10.33736/ils.1797.2020

Zahid, I., Bakar, N. A., Kamaruddin, W. Z., Ali, W., & Jusoff, K. (2022). Pemetaan domain semantik akidah: Penyelesaian kekaburan makna. Gjat, 12(2 1 83), 1–20. Retrieved from www.gjat.my

Zakariya, I., & Syafrullah, M. (2024). Implementasi text mining untuk deteksi ujaran kebencian terhadap Ibu Kota Nusantara menggunakan algorima K-Nearest Neighbors pada platform X. 5th Seminar Nasional Mahasiswa Fakulltas Teknologi Informasi (SENAFTI), 3(3), 263–270.

Published

2025-07-21

Issue

Section

Articles