ASR - Automatic Speech Recognition

Bavieca

Overview
Main Features
References

Overview

The BAVIECA ASR was developed by Daniel Bolaños at Boulder Learning Inc.

The BAVIECA toolkit includes a set of command line tools that can be used to build very sophisticated large vocabulary speech recognition systems from scratch. According to (Bolaños, 2012), "BAVIECA is an open-source speech recognition toolkit intended for speech research and system development. The toolkit supports lattice-based discriminative training, wide phonetic-context, efficient acoustic scoring, large n-gram language models, and the most common feature and model transformations. BAVIECA is written entirely in C++ and presents a simple and modular design with an emphasis on scalability and reusability. BAVIECA achieves competitive results in standard benchmarks. The toolkit is distributed under the highly unrestricted Apache 2.0 license, and is freely available on Source Forge". Moreover, as written in his official web site (http://www.bavieca.org/tools.html#), BAVIECA "offers an Application Programming Interface (API) that exposes speech processing features such as speech recognition, speech activity detection, forced alignment, etc. This API is provided as a C++ library that can be used to create stand-alone applications that exploit BAVIECA's speech recognition features". Compared to existing open-source automatic speech recognition (ASR) toolkits, such as HTK (Young et alii, 2009), CMU-Sphinx (Lee et alii, 1990), (Walker et alii, 2004), RWTH (Rybach et alii, 2009), JULIUS (Lee et alii, 2001) and the more recent Kaldi (Povey, Ghoshal, 2009), BAVIECA is characterized by a simple and modular design that favors scalability and reusability, a small code base, a focus on real-time performance and a highly unrestricted license.

Main Features

As illustrated in the BAVIECA web page (www.bavieca.org) the list below summarizes the main features of the BAVIECA toolkit.

Large vocabulary continuous speech recognition

Dynamic search decoder with support for cross-word triphone and penta-phone HMMs
Weighted Finite State Acceptor (WFSA) based speech decoder and efficient WFSA network builder (cross-word triphones)
Efficient computation of emission probabilities thanks to the use of the nearest neighbor approximation, partial distance elimination and support for Single Instruction Multiple Data (SIMD) parallel computation (x86 architecture only)
Lattice generation (both decoders)
Hypothesis files in NIST formats (SCLITE can be use for scoring hypotheses)

Acoustic modeling

Acoustic models based on continuous density Hidden Markov Models (CD-HMMs) with emission probabilities modeled using mixtures of Gaussian distributions (GMMs)
HMM topology fixed to three states left to right
Variable number of Gaussian components per HMM-state
No explicit modeling of transition probabilities
Diagonal and full covariance modeling
Cross-word context dependency modeling using triphone, pentaphones, heptaphones, etc
Maximum Likelihood Estimation criterion
Discriminative training using boosted Maximum Mutual Information (bMMI) criterion with I-smoothing and cancellation of statistics
Parallel accumulation of sufficient statistics for both Maximum Likelihood and Discriminative Training criteria
Linear algebra support through template classes (Matrix, Vector, etc) wrapping third party libraries (BLAS and LAPACK)

Language modeling

Support for n-gram language models in ARPA and binary formats
Support for any n-gram order (zerogram, unigram, bigram, trigram, four-gram, etc)
Language models are internally represented as Finite State Machines

Speaker adaptation

Model space Maximum Likelihood Linear Regression (MLLR) using regression trees to automatically determine the number of transforms to be us ed and how adaptation data is shared among transforms
Feature space Maximum Likelihood Linear Regression (fMLLR)
Vocal Tract Length Normalization (VTLN)

Feature extraction

Mel Frequency Cepstral Coefficients (MFCC) features
Cesptral Mean Normalization (CMN) and Cepstral Mean Variance Normalization (CMVN) at both utterance or session level
Feature decorrelation and dimensionality reduction using Heteroscedastic Linear Discriminant Analysis (HLDA)
Support for spliced features and third order derivatives

Lattice processing and n-best list generation

Lattice rescoring using different criteria: maximum likelihood or posterilihood or posterior probabilities
Lattice word error rate (WER) computation (oracle)
Lattice alignment and HMM-state marking
Attach LM-scores to lattice edges according to a given language model
Lattice-based posterior probability computation
Confidence annotation
Lattice path-insertion (discriminative training)
Lattices are processed in binary format but text format is available for readability purposes

Speech activity detection

HMM-based speech activity detection

References

Bolaños D. (2012), "The Bavieca Open-Source Speech Recognition Toolkit". In Proceedings of IEEE Workshop on Spoken Language Technology (SLT), December 2-5, 2012, Miami, FL, USA, 2012.

Young S., Evermann G., Gales M., Hain T., Kershaw D., Liu X., Moore G., Odell J., Ollason D., Povey D., Valtchev V., and Woodland P. (2009), The HTK Book (for version 3.4). Cambridge Univ. Eng. Dept., 2009.

Lee K.F., Hon H.W., and Reddy R. (1990), "An overview of the SPHINX speech recognition system". In IEEE Transactions on Acoustics, Speech and Signal Processing 38.1 (1990), 35-45.

Walker W., Lamere P., Kwok P., Raj B., Singh R., Gouvea E., Wolf P., and Woelfel J. (2004), "Sphinx-4: A Flexible Open Source Framework for Speech Recognition," Sun Microsystems Inc., Technical Report SML1 TR2004-0811, 2004.

Rybach D., Gollan C., Heigold G., Hoffmeister B., Lööf J., Schülter R., and Ney H. (2009), "The RWTH Aachen University Open Source Speech Recognition System," in Proc. of INTERSPEECH, 2009, 2111-2114, 2009.

Lee A., Kawahara T., and Shikano K. (2001). "JULIUS - an open source real-time large vocabulary recognition engine". In Proceedings of INTERSPEECH 2001, 1691-1694.

Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlí?ek P., Qian Y., Schwarz P., Silovský J., Stemmer G., Veselý K. (2011), "The Kaldi Speech Recognition Toolkit," in Proc. of ASRU, 2011.

For more information please contact :

Piero Cosi

Istituto di Scienze e Tecnologie della Cognizione - Sede Secondaria di Padova "ex Istituto di Fonetica e Dialettologia";
CNR di Padova (e-mail: piero.cosi@pd.istc.cnr.it).