New Grapheme Generation Rules for Two-Stage Modelbased Grapheme-to-Phoneme Conversion


  • Seng Kheang Toyohashi University of Technology
  • Kouichi Katsurada Toyohashi University of Technology
  • Yurie Iribe Aichi Prefectural University
  • Tsuneo Nitta Waseda University



The precise conversion of arbitrary text into its corresponding phoneme sequence (grapheme-to-phoneme or G2P conversion) is implemented in speech synthesis and recognition, pronunciation learning software, spoken term detection and spoken document retrieval systems. Because the quality of this module plays an important role in the performance of such systems and many problems regarding G2P conversion have been reported, we propose a novel two-stage model-based approach, which is implemented using an existing weighted finite-state transducer-based G2P conversion framework, to improve the performance of the G2P conversion model. The first-stage model is built for automatic conversion of words to phonemes, while the second-stage model utilizes the input graphemes and output phonemes obtained from the first stage to determine the best final output phoneme sequence. Additionally, we designed new grapheme generation rules, which enable extra detail for the vowel and consonant graphemes appearing within a word. When compared with previous approaches, the evaluation results indicate that our approach using rules focusing on the vowel graphemes slightly improved the accuracy of the out-of-vocabulary dataset and consistently increased the accuracy of the in-vocabulary dataset.


Novak, J.R., Dixon, P.R., Minematsu, N., Hirose, K., Hori, C. & Kashioka, H., Improving WFST-based G2P Conversion with Alignment Constraints and RNNLM N-best Rescoring, in Proc. of Interspeech, Portland, Oregon, USA, September 9-13 2012.

Kheang, S., Iribe, Y. & Nitta, T., Letter-To-Phoneme Conversion based on Two-Stage Neural Network focusing on Letter and Phoneme Contexts, in Proc. of Interspeech, pp. 1885-1888, Firenze Fiera, Florence, Italy, 2011.

Kheang, S., Katsurada, K., Iribe, Y. & Nitta, T., Solving the phoneme conflict in Grapheme-To-Phoneme Conversion using a Two-Stage Neural Network-based Approach, The Journal of the Institute of Electronics, Information and Communication Engineers, E97-D, 4, 8(4), pp. 901-910, 2014.

Chomsky, N. & Halle, M., The Sound Pattern of English, New York: NY: Harper and Row, 1968.

Sejnowski, T.J. & Rosenberg, C.R., Parallel Networks that Learn to Pronounce English Text, Complex Systems 1, pp. 145-168, 1987.

Davel, M. & Barnard, E., Pronunciation Prediction with Default & Refine, Computer Speech and Language, 22 Science Direct, Elsevier, pp.374-393, January 2008.

Taylor, P., Hidden Markov Models for Grapheme to Phoneme Conversion, in Proc. of Interspeech, Centro Cultural de Belem, Lisbon, Portugal, September 4-8 2005.

Claveau, V., Letter-to-phoneme Conversion by Inference of Rewriting Rules, in Proc. of Interspeech, pp. 1299-1302, Brighton, UK, September 6-10 2009.

Marchand, Y. & Robert, I.D., A Multi-strategy Approach to Improving Pronunciation by Analogy, Journal of Computational Linguistics, 26(2),pp. 195-219, 2000.

Bilcu, E.B., Text-To-Phoneme Mapping Using Neural Networks, PhD dissertation, Tampere University of Technology, October 2008.

Marks, J.E. & Fabio, A., Neural Networks for Text-to-Speech Phoneme Recognition, IEEE International Conference on Systems, Man & Cyberbetics, pp. 3582-3587, 2000.

Miller, G.A., Language and Speech, W.H. Freeman and Company, San Francisco, 1981.

Rama, T., Singh, A.K. & Kolachina, S., Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training, in Proc. of the NAACL HLT Student Research Workshop and Doctoral Consortium, Boulder, Colorado, USA, June 2009.

Ogbureke, K.U., Cahill, P. & Berndsen, J.C., Hidden Markov Models with Context-Sensitive Observations for Grapheme-to-Phoneme Conversion, in Proc. of Interspeech, Japan, 2010.

Auto-aligned CMUDict corpus, Letter-to-Phoneme Conversion Challenge: 10 folds datasets, (November 2012).

Che, H., Tao, J. & Pan, S., Letter-To-Sound Conversion using Coupled Hidden Markov Models for Lexicon Compression, in Proc. of International Conference on Speech Database and Assessments (Oriental COCOSDA), Macau, pp. 141-144, December 2012.

Bisani, M. & Ney, H., Joint-Sequence Models for Grapheme-to-Phoneme Conversion, Speech Communication, 50(5), pp. 434-451, 2008.

Jiampojamarn, S., Kondrak, G. & Sherif, T., Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion, in the Conference of the North American Chapter of the Association for Computational Linguistics and Human Lanauge Technology (NAACL HLT), Rochester, New York, April 2007.

Jiampojamarn, S. & Kondrak, G., Letter-Phoneme Alignment: An Exploration, in Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, July 2010.

Hahn, S., Vozila, P. & Bisani, M., Comparison of Grapheme-to-Phoneme Methods on Large Pronuciation Dictionaries and LVCSR Tasks, in Proc.of Interspeech, Portland, Oregon, USA, September 9-13 2012.

Libossek, M. & Schiel, F., Syllable-based Text-to-Phoneme Conversion for German, in Proc. of the Sixth International Conference on Spoken Language Processing (ICSLP), pp. 283-286, Beijing, China, 2000.

Lehnen, P., Hahn, S., Guta, V.A. & Ney, H., Hidden Conditional Random Fields with M-to-N Alignments for Grapheme-to-Phoneme Conversion, in Proc. of Interspeech, Portland, Oregon, USA, September 9-13 2012.

Furuya, Y., Natori, S., Nishizaki, H. & Sekiguchi, Y., Introduction of False Detection Control Parameters in Spoken Term Detection, in APSIPA ASC, Hollowood, CA, 2012.

Fiscus, J.G., A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER), in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding, Canada, pp. 347-354, Santa Barbara, CA, December 1997.