Speech Activity Detection and its Evaluation in Speaker Diarization System
DOI:
https://doi.org/10.24297/ijct.v16i1.5893Keywords:
Speaker Diarization System; Artificial Neural Network; Gaussian Mixture Model; ROC; DETAbstract
In speaker diarization, the speech/voice activity detection is performed to separate speech, non-speech and silent frames. Zero crossing rate and root mean square value of frames of audio clips has been used to select training data for silent, speech and nonspeech models. The trained models are used by two classifiers, Gaussian mixture model (GMM) and Artificial neural network (ANN), to classify the speech and non-speech frames of audio clip. The results of ANN and GMM classifier are compared by Receiver operating characteristics (ROC) curve and Detection ErrorTradeoff (DET) graph. It is concluded that neural network based SAD
comparatively better than Gaussian mixture model based SAD.
Downloads
References
Research,†IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 2, pp. 356–370, 2012.
[2] S. Meigner and T. Merlin, “AN OPEN SOURCE TOOLKIT FOR DIARIZATION Sylvain Meignier , Teva Merlin LIUM – Universit ´
du Maine , France.â€
[3] A. S. Toolkit and G. Gravier, “Guillaume Gravier Micha¨ el Betser Mathieu Ben,†no. January, 2010.
[4] D. Vijayasenan and F. Valente, “DiarTk: An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application
to Meetings Recordings.,†Interspeech, pp. 5–8, 2012.
[5] M. Huijbregts, Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled. 2008.
[6] S. H. Yella, A. Stolcke, M. Slaney, and M. View, “ARTIFICIAL NEURAL NETWORK FEATURES FOR SPEAKER
DIARIZATION Idiap Research Institute , CH-1920 Martigny , Switzerland,†pp. 402–406, 2014.
[7] A. Slaby, “ROC analysis with Matlab,†Proc. Int. Conf. Inf. Technol. Interfaces, ITI, pp. 191–196, 2007.
[8] C. Micheal, “The EM algorithm.†1997.
[9] G. Nasr, E. Badr, and C. Joun, “Cross Entropy Error Function in Neural Networks: Forecasting Gasoline Demand.,†FLAIRS Conf., pp.
381–384, 2002.
[10] M. Huijbregts and F. De Jong, “Robust speech/non-speech classification in heterogeneous multimedia content,†Speech Commun., vol.
53, no. 2, pp. 143–153, 2011.
[11] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET Curve in Assessment of Detection Task
Performance,†Proc. Eurospeech ’97, pp. 1895–1898, 1997.
[12] M. Sinclair and S. King, “Where are the challenges in speaker diarization?,†ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. -
Proc., pp. 7741–7745, 2013.