Word Embedding for Rhetorical Sentence Categorization on Scientific Articles

Ghoziyah Haitan Rachman; Masayu Leylia Khodra; Dwi Hendratmo Widyantoro

doi:10.5614/itbj.ict.res.appl.2018.12.2.5

Authors

Ghoziyah Haitan Rachman School of Electrical Engineering and Informatics, Bandung Institute of Technology, Jalan Ganesa No. 10, Bandung 40132,
Masayu Leylia Khodra School of Electrical Engineering and Informatics, Bandung Institute of Technology, Jalan Ganesa No. 10, Bandung 40132,
Dwi Hendratmo Widyantoro School of Electrical Engineering and Informatics, Bandung Institute of Technology, Jalan Ganesa No. 10, Bandung 40132,

DOI:

https://doi.org/10.5614/itbj.ict.res.appl.2018.12.2.5

Keywords:

GloVe, rhetorical sentence categorization, scientific article, word embedding, Word2Vec.

Abstract

A common task in summarizing scientific articles is employing the rhetorical structure of sentences. Determining rhetorical sentences itself passes through the process of text categorization. In order to get good performance, some works in text categorization have been done by employing word embedding. This paper presents rhetorical sentence categorization of scientific articles by using word embedding to capture semantically similar words. A comparison of employing Word2Vec and GloVe is shown. First, two experiments are evaluated using five classifiers, namely Naïve Bayes, Linear SVM, IBK, J48, and Maximum Entropy. Then, the best classifier from the first two experiments was employed. This research showed that Word2Vec CBOW performed better than Skip-Gram and GloVe. The best experimental result was from Word2Vec CBOW for 20,155 resource papers from ACL-ARC, features from Teufel and the previous label feature. In this experiment, Linear SVM produced the highest F-measure performance at 43.44%.

Downloads

Download data is not yet available.

References

Schwegler, R.A. & Shamoon, L.K., The Aims and Process of the Research Paper, College English, 44(8), pp. 817-824, 1982.

Luhn, H.P., The Automatic Creation of Literature Abstracts, IBM Journal, 2(2), pp. 159-165, 1958.

Widyantoro, D.H., Khodra, M.L., Trilaksono, B.R. & Aziz, E.A., A Multiclass-based Classification Strategy Sentence Categorization from Scientific Papers, Journal of ICT Research and Applications, 7(3), pp. 235-249, 2013.

Taboada, M. & Mann, W.C., Rhetorical Structure Theory: Looking Back and Moving Ahead, Discourse studies, 8(3), pp. 423-459, 2006.

Khodra, M.L., Widyantoro, D.H., Aziz, E.A. & Trilaksono, B.R., Automatic Tailored Multi-Paper Summarization Based on Rhetorical Document Profile and Summary Specification, ITB Journal of Information and Communication Technology, 6(3), pp. 220-239, 2012.

Yang, Y. & Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proceedings of the 14th International Conference on Machine Learning, pp. 412-420, 1997.

Rong, X., word2vec Parameter Learning Explained, Cornell University Library, https://arxiv.org/abs/1411.2738, (5 June 2016).

Teufel, S., Argumentative Zoning: Information Extraction from Scientific Text, PhD Dissertation, University of Edinburgh, Edinburgh, 1999.

Merity, S., Murphy, T. & Curran, J., Accurate Argumentative Zoning with Maximum Entropy models, Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pp. 19-26, 2009.

Liu, H., Automatic Argumentative-Zoning Using Word2vec, Cornell University Library, https://arxiv.org/abs/1703.10152, (29 March 2017).

Teufel, S., Siddharthan, A. & Batchelor, C., Towards Discipline-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 1493-1502, 2009.

Putra, Y.A. & Khodra, M.L., Deep Learning and Distributional Semantic Model for Indonesian Tweet Categorization, Proceedings of the 2016 International Conference on Data and Software Engineering (ICoDSE), pp. 1-6, 2016.

Rahmawati, D. & Khodra, M.L., Word2vec Semantic Representation in Multilabel Classification for Indonesian News Article, Proceedings of the 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), pp. 1-6, 2016.

Naili, M., Chaibi, A.H. & Ghezala, H.H.B., Comparative Study of Word Embedding Methods in Topic Segmentation, Procedia Computer Science, 112, pp. 340-349, 2017.

Mikolov, T., Chen, K., Corrado, G. & Dean, J., Efficient Estimation of Word Representations in Vector Space, Cornell University Library, https://arxiv.org/abs/1301.3781, (7 September 2013).

Pennington, J., Socher, R. & Manning, C., Glove: Global Vectors for Word Representation, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.

Heffernan, K. & Teufel, S., Identifying Problem Statements in Scientific Text, Workshop on Foundations of the Language of Argumentation (in conjunction with COMMA), 2016.

Teufel, S. & Moens, M., Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status, Computational Linguistics, 28(4), pp. 409-445, 2002.

Rish, I., An Empirical Study of the Naive Bayes Classifier, IJCAI 2001 workshop on empirical methods in artificial intelligence, pp. 41-46, 2001.

Chang, C.C. & Lin, C.J., LIBSVM: A Library for Support Vector Machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), Article No.27, 2011.

Jiang, L., Cai, Z., Wang, D. & Jiang, S., Survey of Improving K-Nearest-Neighbor for Classification, Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 679-683, 2007.

Bhargava, N., Sharma, G., Bhargava, R. & Mathuria, M., Decision Tree Analysis on J48 Algorithm for Data Mining, International Journal of Advanced Research in Computer Science and Software Engineering, 3(6), pp. 1114-1119, 2013.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H., The WEKA Data Mining Software: An Update, ACM SIGKDD explorations newsletter, 11(1), pp. 10-18, 2009.