Free Model of Sentence Classifier for Automatic Extraction of Topic Sentences

M.L. Khodra, D.H. Widyantoro, E.A. Aziz, B.R. Trilaksono


This  research  employs  free  model  that  uses  only  sentential  features without paragraph context  to extract topic sentences of a paragraph. For finding optimal  combination  of  features,  corpus-based  classification  is  used  for constructing a sentence classifier  as the model.  The sentence classifier is trained by  using Support Vector Machine  (SVM).  The experiment shows that position and meta-discourse features are more important  than syntactic features  to extract topic  sentence,  and  the  best  performer  (80.68%)  is  SVM  classifier  with  all features. 

Full Text:



Jinha, A.E., Article 50 Million: An Estimate of the Number of Scholarly Articles in Existence, 2010.

Jones, K.S., Automatic summarising: The state of the art. Information Processing and Management, 43, pp. 1449-1481, 2007.

Teufel, S. Argumentative Zoning: Information Extraction from Scientific Text. PhD Dissertation, University of Edinburgh, 1999.

Khodra, M.L., Widyantoro, D.H., Aziz, E.A., Trilaksono, B.R., Konstruksi Koleksi Utama Paragraf, in Proc. Konferensi Nasional Informatika, 2010.

McCarthy, P.M., et al., Identifying Topic Sentencehood, Behavior Research Methods, , 2008.

Kaplan, R., Cultural Thought Patterns in Inter-Cultural Education.Landmark Essay on ESL Writing, 1966.

Baxendale, P.B., Machine-made index for technical literature—an experiment. IBM Journal of Research and Development, 1958.

Smith, C.G., Braddock Revisited: The Frequency and Placement of Topic Sentences in Academic Writing, The Reading Matrix, 8(1), pp. 78-95, 2008.

Theijssen, D., Features for Automatic Discourse Analysis of Paragraphs: Finding Features to Detect Rhetorical Relations Between Sentences Within Paragraphs, Master thesis, Department of Linguistics, Radboud University Nijmegen, 2007.

Hyland, K. & Tse, P., Metadiscourse in Academic Writing: A Reappraisal, Applied Linguistics 25/2, pp. 156-177, Oxford University Press, 2004.

Kupiec, J., et al., A Trainable Document Summarizer, ACM SIGIR, 1995.

Teufel, S., Moens, M., Sentence Extraction as A Classification Task, Proceedings of the ACL, 1997.

ACL Anthology Reference Corpus (ACL ARC): (August 2009).

Bird, S., et al., The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics, in Proc. of Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, May 2008.

Stanford Parser: a statistical parser, The Stanford Natural Language Processing Group,, March 18th , 2010.

The Penn Treebank Project,, October 2nd, 2010.

Relationship between sentences, labhandouts/R6%20Rel.%20bet.%20Sentences.doc , April 22nd, 2010.

WordNet: a lexical database for English, Princeton University,, December 10th, 2009.

MIT Java Wordnet Interface, MIT, , December 10th, 2009.

MRC psycholinguistics database, se/uwa_mrc.htm , March 22nd, 2010.

jMRC - MRC Psycholinguistic Database Java Interface v0.9, , March 22nd, 2010.

Kohavi, R. & John, G., Wrappers for feature subset selection, Artificial Intelligence, 97(1-2), pp. 273-324, 1997

Paz, E.C., et al., Feature Selection in Scientific Applications, in Proc. International Conference on Knowledge Discovery and Data Mining, 2004

Joachims, T., Learning To Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms, Dissertation, University Dortmund, Kluwer Academic Publishers, 2001.

Chih-Chung Chang and Chih-Jen Lin, LIBSVM -- A Library for Support Vector Machines, , November 19th, 2009.

Platt, J.C., Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in Advances in Large Margin Classifiers, MIT Press, 2009.

Lin, H.T., et al., A Note on Platt’s Probabilistic Outputs for Support Vector Machines, Technical Report, Department of Computer Science, National Taiwan University, 2004

Hsu, C.W., et al., Practical Guide to Support Vector Classification,, December 16th, 2009.

Sebastiani, F., Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), March 2002.



  • There are currently no refbacks.

Contact Information:

ITB Journal Publisher, LPPM – ITB, 

Center for Research and Community Services (CRCS) Building Floor 7th, 
Jl. Ganesha No. 10 Bandung 40132, Indonesia,

Tel. +62-22-86010080,

Fax.: +62-22-86010051;