Overview
The BAVIECA ASR was developed by Daniel Bolaños at Boulder Learning Inc.
The BAVIECA toolkit includes a set of command line tools that can be used to build very sophisticated large vocabulary speech recognition systems from scratch.
According to (Bolaños, 2012), "BAVIECA is an open-source speech recognition toolkit intended for speech research and system development. The toolkit supports lattice-based discriminative training, wide phonetic-context, efficient acoustic scoring, large n-gram language models, and the most common feature and model transformations.
BAVIECA is written entirely in
C++ and presents a simple and modular design with an emphasis on scalability and reusability.
BAVIECA achieves competitive results in standard benchmarks.
The toolkit is distributed under the highly unrestricted Apache 2.0 license, and is freely available on Source Forge".
Moreover, as written in his official web site (http://www.bavieca.org/tools.html#), BAVIECA "offers an Application Programming Interface (API) that exposes speech processing features such as speech recognition, speech activity detection, forced alignment, etc.
This API is provided as a C++ library that can be used to create stand-alone applications that exploit BAVIECA's speech recognition features".
Compared to existing open-source automatic speech recognition (ASR) toolkits, such as HTK (Young et alii, 2009), CMU-Sphinx (Lee et alii, 1990), (Walker et alii, 2004), RWTH (Rybach et alii, 2009), JULIUS (Lee et alii, 2001) and the more recent Kaldi (Povey, Ghoshal, 2009), BAVIECA is characterized by a simple and modular design that favors scalability and reusability, a small code base, a focus on real-time performance and a highly unrestricted license.
Main Features
As illustrated in the BAVIECA web page (www.bavieca.org) the list below summarizes
the main features of the BAVIECA toolkit.
Large vocabulary continuous speech recognition
- Dynamic search decoder with support for cross-word triphone and penta-phone HMMs
- Weighted Finite State Acceptor (WFSA) based speech decoder and efficient WFSA network builder (cross-word triphones)
- Efficient computation of emission probabilities thanks to the use of the nearest neighbor approximation, partial distance elimination and support for Single Instruction Multiple Data (SIMD) parallel computation (x86 architecture only)
- Lattice generation (both decoders)
- Hypothesis files in NIST formats (SCLITE can be use for scoring hypotheses)
Acoustic modeling
- Acoustic models based on continuous density Hidden Markov Models (CD-HMMs) with emission probabilities modeled using mixtures of Gaussian distributions (GMMs)
- HMM topology fixed to three states left to right
- Variable number of Gaussian components per HMM-state
- No explicit modeling of transition probabilities
- Diagonal and full covariance modeling
- Cross-word context dependency modeling using triphone, pentaphones, heptaphones, etc
- Maximum Likelihood Estimation criterion
- Discriminative training using boosted Maximum Mutual Information (bMMI) criterion with I-smoothing and cancellation of statistics
- Parallel accumulation of sufficient statistics for both Maximum Likelihood and Discriminative Training criteria
- Linear algebra support through template classes (Matrix, Vector, etc) wrapping third party libraries (BLAS and LAPACK)
Language modeling
- Support for n-gram language models in ARPA and binary formats
- Support for any n-gram order (zerogram, unigram, bigram, trigram, four-gram, etc)
- Language models are internally represented as Finite State Machines
Speaker adaptation
- Model space Maximum Likelihood Linear Regression (MLLR) using regression trees to automatically determine the number of transforms to be us ed and how adaptation data is shared among transforms
- Feature space Maximum Likelihood Linear Regression (fMLLR)
- Vocal Tract Length Normalization (VTLN)
Feature extraction
- Mel Frequency Cepstral Coefficients (MFCC) features
- Cesptral Mean Normalization (CMN) and Cepstral Mean Variance Normalization (CMVN) at both utterance or session level
- Feature decorrelation and dimensionality reduction using Heteroscedastic Linear Discriminant Analysis (HLDA)
- Support for spliced features and third order derivatives
Lattice processing and n-best list generation
- Lattice rescoring using different criteria: maximum likelihood or posterilihood or posterior probabilities
- Lattice word error rate (WER) computation (oracle)
- Lattice alignment and HMM-state marking
- Attach LM-scores to lattice edges according to a given language model
- Lattice-based posterior probability computation
- Confidence annotation
- Lattice path-insertion (discriminative training)
- Lattices are processed in binary format but text format is available for readability purposes
Speech activity detection
- HMM-based speech activity detection
References
Bolaños D. (2012), "The Bavieca Open-Source Speech Recognition Toolkit". In Proceedings of IEEE Workshop on Spoken Language Technology (SLT), December 2-5, 2012, Miami, FL, USA, 2012.
Young S., Evermann G., Gales M., Hain T., Kershaw D., Liu X., Moore G., Odell J., Ollason D., Povey D., Valtchev V., and Woodland P. (2009), The HTK Book (for version 3.4). Cambridge Univ. Eng. Dept., 2009.
Lee K.F., Hon H.W., and Reddy R. (1990), "An overview of the SPHINX speech recognition system". In IEEE Transactions on Acoustics, Speech and Signal Processing 38.1 (1990), 35-45.
Walker W., Lamere P., Kwok P., Raj B., Singh R., Gouvea E., Wolf P., and Woelfel J. (2004), "Sphinx-4: A Flexible Open Source Framework for Speech Recognition," Sun Microsystems Inc., Technical Report SML1 TR2004-0811, 2004.
Rybach D., Gollan C., Heigold G., Hoffmeister B., Lööf J., Schülter R., and Ney H. (2009), "The RWTH Aachen University Open Source Speech Recognition System," in Proc. of INTERSPEECH, 2009, 2111-2114, 2009.
Lee A., Kawahara T., and Shikano K. (2001). "JULIUS - an open source real-time large vocabulary recognition engine". In Proceedings of INTERSPEECH 2001, 1691-1694.
Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlí?ek P., Qian Y., Schwarz P., Silovský J., Stemmer G., Veselý K. (2011), "The Kaldi Speech Recognition Toolkit," in Proc. of ASRU, 2011.

For more information please contact :
Piero Cosi
|
Istituto di Scienze e Tecnologie della Cognizione - Sede Secondaria di Padova "ex Istituto di Fonetica e Dialettologia"; CNR di
Padova (e-mail: piero.cosi@pd.istc.cnr.it). |
