Free Model of Sentence Classifier for Automatic Extraction of Topic Sentences

M.L. Khodra; D.H. Widyantoro; E.A. Aziz; B.R. Trilaksono

doi:10.5614/itbj.ict.2011.5.1.2

Authors

M.L. Khodra School of Electrical Engineering and Informatics, Bandung Institute of Technology, Indonesia
D.H. Widyantoro School of Electrical Engineering and Informatics, Bandung Institute of Technology, Indonesia
E.A. Aziz Faculty of Language and Arts Education, Indonesia University of Education, Indonesia
B.R. Trilaksono School of Electrical Engineering and Informatics, Bandung Institute of Technology, Indonesia

DOI:

https://doi.org/10.5614/itbj.ict.2011.5.1.2

Abstract

This research employs free model that uses only sentential features without paragraph context to extract topic sentences of a paragraph. For finding optimal combination of features, corpus-based classification is used for constructing a sentence classifier as the model. The sentence classifier is trained by using Support Vector Machine (SVM). The experiment shows that position and meta-discourse features are more important than syntactic features to extract topic sentence, and the best performer (80.68%) is SVM classifier with all features.

Downloads

Download data is not yet available.

References

Jinha, A.E., Article 50 Million: An Estimate of the Number of Scholarly Articles in Existence, 2010.

Jones, K.S., Automatic summarising: The state of the art. Information Processing and Management, 43, pp. 1449-1481, 2007.

Teufel, S. Argumentative Zoning: Information Extraction from Scientific Text. PhD Dissertation, University of Edinburgh, 1999.

Khodra, M.L., Widyantoro, D.H., Aziz, E.A., Trilaksono, B.R., Konstruksi Koleksi Utama Paragraf, in Proc. Konferensi Nasional Informatika, 2010.

McCarthy, P.M., et al., Identifying Topic Sentencehood, Behavior Research Methods, http://brm.psychonomic-journals.org/ , 2008.

Kaplan, R., Cultural Thought Patterns in Inter-Cultural Education.Landmark Essay on ESL Writing, 1966.

Baxendale, P.B., Machine-made index for technical literature"an experiment. IBM Journal of Research and Development, 1958.

Smith, C.G., Braddock Revisited: The Frequency and Placement of Topic Sentences in Academic Writing, The Reading Matrix, 8(1), pp. 78-95, 2008.

Theijssen, D., Features for Automatic Discourse Analysis of Paragraphs: Finding Features to Detect Rhetorical Relations Between Sentences Within Paragraphs, Master thesis, Department of Linguistics, Radboud University Nijmegen, 2007.

Hyland, K. & Tse, P., Metadiscourse in Academic Writing: A Reappraisal, Applied Linguistics 25/2, pp. 156-177, Oxford University Press, 2004.

Kupiec, J., et al., A Trainable Document Summarizer, ACM SIGIR, 1995.

Teufel, S., Moens, M., Sentence Extraction as A Classification Task, Proceedings of the ACL, 1997.

ACL Anthology Reference Corpus (ACL ARC): http://aclarc.comp.nus.edu.sg/ (August 2009).

Bird, S., et al., The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics, in Proc. of Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, May 2008.

Stanford Parser: a statistical parser, The Stanford Natural Language Processing Group, http://nlp.stanford.edu/software/lex-parser.shtml, March 18th , 2010.

The Penn Treebank Project, http://www.cis.upenn.edu/~Treebank/, October 2nd, 2010.

Relationship between sentences, http://www1.fccj.org/lchandouts/reading labhandouts/R6%20Rel.%20bet.%20Sentences.doc , April 22nd, 2010.

WordNet: a lexical database for English, Princeton University, http://wordnet.princeton.edu/, December 10th, 2009.

MIT Java Wordnet Interface, MIT, http://projects.csail.mit.edu/jwi/ , December 10th, 2009.

MRC psycholinguistics database, http://www.psy.uwa.edu.au/mrcdataba se/uwa_mrc.htm , March 22nd, 2010.

jMRC - MRC Psycholinguistic Database Java Interface v0.9, http://mi.eng.cam.ac.uk/~farm2/jmrc/index.html , March 22nd, 2010.

Kohavi, R. & John, G., Wrappers for feature subset selection, Artificial Intelligence, 97(1-2), pp. 273-324, 1997

Paz, E.C., et al., Feature Selection in Scientific Applications, in Proc. International Conference on Knowledge Discovery and Data Mining, 2004

Joachims, T., Learning To Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms, Dissertation, University Dortmund, Kluwer Academic Publishers, 2001.

Chih-Chung Chang and Chih-Jen Lin, LIBSVM -- A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ , November 19th, 2009.

Platt, J.C., Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in Advances in Large Margin Classifiers, MIT Press, 2009.

Lin, H.T., et al., A Note on Platt's Probabilistic Outputs for Support Vector Machines, Technical Report, Department of Computer Science, National Taiwan University, 2004

Hsu, C.W., et al., Practical Guide to Support Vector Classification, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, December 16th, 2009.

Sebastiani, F., Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), March 2002.