ASR - Automatic Speech Recognition

Sphinx

Overview
Main Features
References

Overview

The SPHINX system is a open-source project which provides a complete set of functions to develop complex Automatic Speech Recognition systems. This software has been developed by Carnegie Mellon University at Pittsburgh. It includes both an acoustic trainer and various decoders, for text recognition, phoneme recognition, Nbest list generation and more.

Main Features

SPHINX training is an iterative sequence of alignments and AM-estimations. It starts from an audio segmentation aligned to training-data transcriptions and it estimates a raw first AM from them. This is the starting point of the following loops of Baum-Welch probability density functions estimation and transcription alignment.

Models can be computed either for each phoneme (Contest Independent, CI) or, considering phoneme context (Contest Dependent, CD). SPHINX acoustic models are trained over MFCC + Δ + Δ2 feature vectors.

While the training process is unique, in the decoding step different versions of the recognizer can be used.

There are various versions, and, in particular, SPHINX-3, is a C-based state-of-the–art large-vocabulary continuous-model ASR. It is limited to 3 or 5-state left-to-right HMM topologies and to a bigram or trigram language model. The decoder is based on the conventional Viterbi search algorithm and beam search heuristics. It uses a lexical-tree search structure, too, in order to prune the state transitions. As the other systems, it produces a single best recognition result (or hypothesis) for each utterance processed which is a linear word sequence.

The SPHINX system is available at (https://cmusphinx.github.io/).

References

Lee, K.F., Hon, H.W., Reddy, R. (1990). "An overview of the SPHINX speech recognition system". IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 38, No. 1., pp. 35-45.

Cosi, P., Hosom, J.P. (2000), "High Performance 'General Purpose' Phonetic Recognition for Italian, Proc. of ICSLP 2000, International Conference on Spoken Language Processing, vol. II, pp. 527--530. Beijing, China.

For more information please contact :

Piero Cosi

Istituto di Scienze e Tecnologie della Cognizione - Sede Secondaria di Padova "ex Istituto di Fonetica e Dialettologia";
CNR di Padova (e-mail: piero.cosi@pd.istc.cnr.it).