Improving the Performance of Low-resourced Speaker Identification with Data Preprocessing

Authors

  • Win Lai Lai Phyu Natural Language and Speech Processing Lab., University of Computer Studies, Yangon. No. (4) Main Road, Yangon, 11411, Myanmar
  • Hay Mar Soe Naing Natural Language and Speech Processing Lab., University of Computer Studies, Yangon. No. (4) Main Road, Yangon, 11411, Myanmar
  • Win Pa Pa Natural Language and Speech Processing Lab., University of Computer Studies, Yangon. No. (4) Main Road, Yangon, 11411, Myanmar

DOI:

https://doi.org/10.5614/itbj.ict.res.appl.2023.17.3.1

Keywords:

Burmese speech dataset, data scrutiny, Mel-frequency cepstral coefficients (MFCCs), multilingual speaker identification, time delay neural network (TDNN)

Abstract

Automatic speaker identification is done to tackle daily security problems. Speech data collection is an essential but very challenging task for under-resourced languages like Burmese. The speech quality is crucial to accurately recognize the speaker?s identity. This work attempted to find the optimal speech quality appropriate for Burmese tone to enhance identification compared with other more richy resourced languages on Mel-frequency cepstral coefficients (MFCCs). A Burmese speech dataset was created as part of our work because no appropriate dataset available for use. In order to achieve better performance, we preprocessed the foremost recording quality proper for not only Burmese tone but also for nine other Asian languages to achieve multilingual speaker identification. The performance of the preprocessed data was evaluated by comparing with the original data, using a time delay neural network (TDNN) together with a subsampling technique that can reduce time complexity in model training. The experiments were investigated and analyzed on speech datasets of ten Asian languages to reveal the effectiveness of the data preprocessing. The dataset outperformed the original dataset with improvements in terms of equal error rate (EER). The evaluation pointed out that the performance of the system with the preprocessed dataset improved that of the original dataset.

Downloads

Download data is not yet available.

References

Phyu, W.L.L. & Pa, W.P., Building Speaker Identification Dataset for Noisy Conditions, The 18th International Conference on Computer Applications, ICCA IEEE 2020, Yangon, Myanmar , pp. 182-188, 27-28 Feb 2020.

Haizhou, L. & Ti, A.A., ASEANN Language Speech Translation through U-STAR, https://www.nict.go.jp/en/asean_ivo/lde9n2000000selb-att/lde9n2000000sesr.pdf, ASEAN IVO Forum 2019, Manila, Philippines, 21 Nov 2019.

Ko, T., Peddinti, V., Povey, D. & Khudanpur, S., Audio Augmentation for Speech Recognition, Proceedings of INTERSPEECH2015, pp. 3586-3589, 2015.

Beigi, H., Speaker Recognition: Advancements and Challenges, in New Trends and Developments in Biometrics, 28 Nov 2012. https://www.intechopen.com/books/3120.

Li, A., Zheng C. & Li, X., Glance and Gaze: A Collaborative Learning Framework for Single-Channel Speech Enhancement, Applied Acousitcs, 187, 108.499, 1 Feb 2022.

Lemercier, J.M., Julius, R., Simon, W. & Timo, G., StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation, IEEE ACM TASLP 2023, 31, pp. 2724-2737, 2023.

Naing, H.M.S., Hidayat, R., Hartanto, R. & Miyanaga, Y., A Front-End Technique for Automatic Noisy Speech Recognition, The 23th International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, Oriental COCOSDA 2020, Yangon, Myanmar, Nov, 2020.

Imam, S.A., Bansal, P. & Singh, V., Review: Speaker Recognition Using Automated Systems, AGU International Journal of Engineering & Technology, AGUIJET 2017, 5, pp. 31-39, Jul-Dec, 2017.

Mezghani, E., Charfeddine, M., Nicolas, H., Ben Amar, C., Speaker Gender Identification Based on Majority Vote Classifiers, Proceedings of SPIE: International Conference on Machine Vision (ICMV2016), Nice, France, 17 Mar 2017, pp. 47-51, 2017.

Ali, Y.M., Emilia, N., Nor Fadzilah, M., Siti Zubaidah Md, S., Mohd Hanapiah, A. & Chee Chin, L., Speech-based Gender Recognition Using Linear Prediction and Mel-Frequency Ceptral Coefficients, IJEECS, vol. 28(2), pp. 753-761, 1 Nov 2022.

Synder, D., Garcia-Romero, D. & Povey, D., Time Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition, IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE ASRU, Scottsdale, AZ, pp. 92-97, 1 December 2015.

Waibel, A., Hanazawa, T, Hinton, G., Shikano, K. & Lang, K., Phoneme Recognition Using Time Delay Neural Networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), pp. 328-339, Mar, 1989.

Peddiniti, V., Povey, D. & Khudanpur, S., A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts, Proceedings of Interspeech, pp. 3214-3218, 2015.

Park, H., Lee, D., Lim, M., Kang, Y., Oh, J. & Kim, J.H., A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network, 11 July 2018, Retrieved from https://arxiv.org/abs/1807.05855.

Ge, Z., Sharma, S.R. & Smith, M.J.T.,PCA/LDA Approach for Text-Independent Speaker Recognition, Proceedings of Society of Photo-Optical Instrumentation Engineers, SPIE 8401, Independent Component Analyses, Compressive Sampling, Wavelets, Neural Net, Biosystems, and Nanoengineering X, 23-27 April 2012.

Chakroun, R. & Frikha, M., A Deep Learning Approach for Text-Independent Speaker Recognition with Short Utterances, Multimedia Tools and Applications, pp. 1-23, Mar, 2023.

Rajan, P., Afanasyev, A., Hautamaki, V. & Kinnunen, T., From Single to Multiple Enrollment I-Vectors: Practical PLDA Scoring Variants for Spaeaker Verification, Journal of Digital Signal Processing, 31, pp. 93-101, 1 Aug 2014.

Ahilan, K., Vogt, R., Dean, D. & Sridharan, S., PLDA Based Speaker Recognition on Shor Utterances, The Speaker and Language Recognition Workshop, Odyssey 2012, Singapore, pp. 28-33, 25-28 June 2012.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hennemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G. & Vesely, K., The Kaldi Speech Recognition Toolkit, IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU2011), 2011.

Cheng, J.M. & Wang, H.C., A Method of Estimating the Equal Error Rate for Automatic Speaker Identification, International Symposium on Chinese Spoken Language Processing (ISCSLP 2004), Hong Kong, pp. 285-288, Symposium conducted at The Chinese University of Hong Kong, 15-18 Dec 2004.

Phyu, W.L.L. & Pa, W.P., Text Independent Speaker Identificaiton for Myanmar Speech, The 11th International Conference on Future Computer and Commnications, ICFCC 2019, Yangon, Myanmar , pp. 86-89, 27-28 Feb 2019.

Downloads

Published

2023-12-31

How to Cite

Phyu, W. L. L., Naing, H. M. S., & Pa, W. P. (2023). Improving the Performance of Low-resourced Speaker Identification with Data Preprocessing. Journal of ICT Research and Applications, 17(3), 275-291. https://doi.org/10.5614/itbj.ict.res.appl.2023.17.3.1

Issue

Section

Articles