A Multiclass-based Classification Strategy for Rethorical Sentence Categorization from Scientific Papers

Dwi H. Widyantoro, Masayu L. Khodra, Bambang Riyanto Trilaksono, E. Aminudin Aziz

Abstract


Rapid identification of content structures in a scientific paper is of great importance particularly for those who actively engage in frontier research. This paper presents a multi-classifier approach to identify such structures in terms of classification of rhetorical sentences in scientific papers. The idea behind this approach is based on an observation that no single classifier is the best performer for classifying all rhetorical categories of sentences. Therefore, our approach learns which classifiers are good at what categories, assign the classifiers for those categories and apply only the right classifier for classifying a given category. This paper employsk-fold cross validation over training data to obtain the category-classifier mapping and then re-learn the classification model of the corresponding classifier using full training data on that particular category. This approach has been evaluated for identifying sixteen different rhetorical categories on sentences collected from ACL-ARC paper collection. The experimental results show that the multi-classifier approach can significantly improve the classification performance over multi-label classifiers.

Full Text:

PDF

References


Teufel, S., Argumentative Zoning: Information Extraction from Scientific Text, PhD Dissertation, University of Edinburgh, 1999.

Li, Y., Gorman, S. & Elhadad, N., Section Classification in Clinical Notes Using Supervised Hidden Markov Model, In Proceedings of the 1st ACM International Health Informatics Symposium. ACM, New York, USA, pp. 744-750, 2010.

Guo, Y., Korhonen, A., Liakata, M., Silins, B., Sun, L. & Stenius, U., Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes, In Proceeding of the Workshop on Biomedical Natural Language Processing, pp. 99-107, 2010.

John, G.H. & Langley, P., Estimating Continuous Distributions in Bayesian Classifiers, In Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, pp.338-345, 1995.

Teufel, S., Siddhartan, A. & Batchelor, C., Towards Discipline-Independent Argumentative zoning Evidence from Chemistry and Computational Linguistics, In Proceeding of Conference on Empirical Methods in NLP, 3, pp. 1493-1502, 2009.

Liakata, M., Teufel, S., Siddharthan, A. & Batchelor, C., Corpora for the Conceptualisation and Zoning of Scientific Papers, In Proceedings of the 7th International Conference on Language Resources and Evaluation, pp. 2054-2061, 2010.

Merity, S., Murphy, T., & Curran, J., Accurate Argumentative Zoning with Maximum Entropy Models, In Proceedings of the ACL Workshop on Text and Citation Analysis for Scholarly Digital Library, pp. 19-26, 2009.

Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M. & Biocentre, M., Identifying Sections in Scientific Abstracts using Conditional Random Fields, InProceedings of IJCNLP, pp. 381-388, 2008.

Ruch, P., Boyer, C., Chichester, C., Tbahriti, I., Geissbuhler, A., Fabry, P., Gobeill, J., Pillet, V., Rebholz-Schuhmann, D. & Lovis, C., Using Argumentative to Extract Key Sentences from Biomedical Abstracts, International Journal of Medical Informatics, 76(2-3), pp.195-200, 2007.

McKnight, L & Srinivasan, P., Categorization of Sentence Types in Medical Abstracts, In Proceedings of the AMIA Annual Symposium, pp. 440-444, 2003.

Lin, J. Karakos, D., Demmer-Fushman, D. & Khudanpur, S., Generative Content Models for Structural Analysis of Medical Abstracts, In Proceedings of the HLT-NA ACL BioNLP Workshop, pp. 65-72, 2006.

Dietterich, T.G., Ensemble Methods in Machine Learning, Multiple Classifier Systems: Lecture Notes in Computer Science, Springer Verlag, 1857/2000, pp. 1-15, 2000.

Ranawana, R., Multi-Classifier Systems-Review and a Roadmap for Developers, International Journal of Hybrid Intelligent Systems, 3(1), pp. 35-61, 2006.

Corney, D.P.A., Buxton, B.F., Langdon, W.B. & Jones, D.T., BioRAT: Extracting Biological Information from Full-Length Papers, Bioinformatics, 20(17), pp. 3206–3213, 2004.

Wyner, A., Mochales-Palau, R., Moens, M.F. & Milward, D., Approaches to Text Mining Arguments from Legal Cases, Lecture Notes in Computer Science, 6036 , pp 60-79, 2010.

Cessie, S. & van Houwelingen, J.C., Ridge Estimators in Logistic Regression, Applied Statistics, 41(1), 191-201, 1992.

Witten, I.H. & Frank, E., Data Mining: Practical Machine Learning Tools and Techniques, 2nd Ed., Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 2005.

Aha, D. & Kibler, D., Instance-Based Learning Algorithms, Machine Learning. 6, Pp.37-66, 1991.

Frank, E. & Witten, I.H., Generating Accurate Rule Sets Without Global Optimization, In: Fifteenth International Conference on Machine Learning, pp. 144-151, 1998.

Platt, J.C., Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Advances in Kernel Methods- Support Vector Learning, The MIT Press, 1999.




DOI: http://dx.doi.org/10.5614%2Fitbj.ict.res.appl.2013.7.3.5

Refbacks

  • There are currently no refbacks.


Contact Information:

ITB Journal Publisher, LPPM – ITB, 

Center for Research and Community Services (CRCS) Building Floor 7th, 
Jl. Ganesha No. 10 Bandung 40132, Indonesia,

Tel. +62-22-86010080,

Fax.: +62-22-86010051;

e-mail: jictra@lppm.itb.ac.id.