Automatic Speech Recognition: A Survey

Rajan Mehla, Mamta ., R.K. Aggarwal

Abstract


This paper explains the concept of Automatic speech recognition (ASR) from the view point of pattern recognition. An ASR system can be broadly classified into two parts: front end and the back end which are responsible for feature extraction and acoustic modelling respectively. The presented paper elaborates and compares all popular feature extraction and acoustic modelling techniques along with the challenges and advancements in the field of ASR.

Keywords


automatic speech recognition, acoustic models, back-end, feature extractors, front-end

References


K.H. Davis, R. Biddulph and S. Balashek, “Automatic Recognition of Spoken Digits,” J. Acoustic Society of America, vol. 24, no. 6, pp. 637-642, 1952.

J.W. Forgie and C.D. Forgie, “Results Obtained From a Vowel Recognition Computer Program,” J. Acoustic Society of America, vol. 31, no. 11, pp. 1480-1489, 1959

V.M. Velichko and N.G. Zagorukyo, “Automatic Recognition of 200 Words,” Int. J. Man-Machine Studies, pp. 2-223, 1970.

H. Sakoe and S. Chiba, “Dynamic Programming Optimization for Spoken Word Recognition,” IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-26, no. 1, pp.43-49, 1978.

F. Itakura, “Minimum Prediction Residual Applied to Speech Recognition,” IEEE Trans. Acoustics, Speech, Signal Proc., ASSP -23 no. 1, pp. 67-72,1975.

L.R. Rabiner, S.E. Levinson, A.E. Rosenberg and J.G. Wilpon, “Speaker Independent Recognition of Isolated Words using Clustering Techniques,” IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-27, pp. 336-349, 1979.

L.R.Rabiner and B.H. Juang, “Fundamentals of Speech Recognition,” Prentice Hall, ISBN 0-13-015157-2, 1993.

C.H. Lee, “On stochastic feature and model compensation approaches for robust speech recognition,” Speech Commun., vol. 25, pp. 29-47, 1998.

R. Lippmann and B. Carlson, “A robust speech recognition with time-varying filtering, interruptions, and noise,” IEEE Workshop on Speech Recognition, pp. 365-372, 1997.

H. Hermansky, N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech Audio Process., vol. 2, pp. 578—589, 1994.

M. Rahim, B.-H. Juang, W. Chou, E. Buhrke, “Signal conditioning techniques for robust speech recognition,” IEEE Signal Process. Letters, vol. 3, pp. 107-109, 1996.

M. A. Anusuya, and S. K. Katti, “Speech Recognition by machine: A review,” Int. J. Computer Science and Information Security, vol. 6, no. 3, pp. 181-205, 2009.

M. Pandya, “Data Driven Feature Extraction and Parameterization for Speech Recognition,” M.Tech Thesis, IIT Kanpur, 2005.

B. Yegnanarayana and R. N. J. Veldhuis, “Extraction of Vocal-Tract system Characteristics from Speech Signals,” IEEE Trans. of Acoustics, Speech and Signal Processing, vol. 6 no. 4, pp. 313-327, 1998.

H. Beigi, “Fundamentals of Speaker Recognition,” Springer.

J. W. Picone, “Signal Modeling Technique in Speech Recognition,” Proc. IEEE, vol. 81, no. 9, pp. 1215-1247, 1993.

L. R. Rabiner and S. E. Levinson, “Isolated and Converted Word Recognition Theory and Selected Applications,” (Invited Paper) IEEE, 1981.

B. Yegnanarayana and R. N. J. Veldhuis, “Extraction of Vocal-Tract system Characteristics from Speech Signals,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 4, pp. 313-327, July 1998.

J. Hai and E. M. Joo, “Improved Linear Predictive Coding method for Speech Recognition,” Proc. joint conference Fourth International Conference on Information, communications and signal processing and multimedia, vol. 3, pp. 1614-1613, Oct 2003.

S.B. Devis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28 (4), 1980.

H. Hermansky, “Perceptually linear predictive (PLP) analysis of speech,” J. Acoustical Society of America, vol. 87, pp. 1738–1752, Apr. 1990.

A. Zolnay, R. Schluter, and H. Ney, “Acoustic Feature Combination for Robust Speech Recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, 2005.

A.Revathi, R.Ganapathy and Y.Venkataramani, “Text Independent Speaker Recognition and Speaker Independent Speech Recognition Using Iterative Clustering Approach,” Int. J. Comp. Sci. and Info. Tech., vol. 1, no. 2, 2009.

H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 2, no. 4, pp. 587-589, 1994.

H. Hermansky and S. Sharma, “Temporal patterns (TRAPs) in ASR of noisy speech,” Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 1999, vol. 1, pp. 289–292.

M. De Wachter et al., “Template-based continuous speech recognition,” IEEE Trans. Audio, Speech, and Language Processing, vol.15, pp. 1377–1390, 2007.

X. D. Huang and M. A. Jack, “Semi-continuous hidden Markov models for speech signals,” Computer Speech and Language, vol. 3, no. 3, pp. 329–252, 1989.

S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition,” Bell System Technical Journal, vol. 62, no. 4, pp. 1035–1074, 1983.

S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 35, pp. 400–401, 1987.

R.K. Aggarwal and M. Dave, “Acoustic Modeling Problem for Automatic speech Recognition system : Conventional methods (Part-І),” Int. J. Speech Tech., Springer Verlag, vol. 14, no. 4, pp. 297-308, Dec 2011.

L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 77, pp. 257–286, 1989.

X.D. Huang, Y. Ariki and M. Jack, “Hidden Markov Models for Speech Recognition,” Edinburgh University Press, Edinburgh, 1990.

R.K. Aggarwal and M. Dave, “Acoustic Modeling Problem for Automatic speech Recognition system : Advances and Refinements (Part-ІІ),” Int. J. Speech Tech., Springer Verlag, vol. 14, no. 4, pp. 309-320, Dec 2011.

L. Rabiner, B. H. Juang, and B. Yegnarayana, “Fundamentals of Speech Recognition,” Pearson Education, India, 2010.

G. D. Forney, “The Viterbi algorithm,” Proc. IEEE, vol. 61, pp. 268–278, 1973.

L. R. Welch, “Hidden Markov models and the Baum-Welch algorithm,” IEEE Information Theory Society Newsletter, vol. 53, no. 4, pp. 10–13, 2003.

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Statistical Society, vol. 39, pp. 1–38, 1977.

C. Bishop, “Neural Networks for Pattern Recognition,” Clarendon Press, Oxford, 1995.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. Lang, “Phoneme recognition using time delay neural networks,” IEEE Trans. Acoustics, Speech and Signal Processing, vol.37, pp. 328-339, 1989.

A.R. Douglas, C.R. Richard, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, Jan. 1995.

R. Solera-Ure˜na, J. Padrell-Sendra et.al, “SVMs for Automatic Speech Recognition: A Survey,” Signal Theory and Communications Department EPS-Universidad Carlos III de Madrid,Avda., de la Universidad, 30, 28911-Legan´es (Madrid), SPAIN.

F. P´erez-Cruz and O. Bousquet, “ Kernel Methods and Their Potential Use in Signal Processing,” IEEE Signal Processing Magazine, vol. 21, no. 3, pp. 57–65, 2004.


Full Text: PDF

Refbacks

  • There are currently no refbacks.




 


All Rights Reserved © 2012 IJARCSEE


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.