ASR - Automatic Speech Recognition

Julius

Overview
Main Features
References

Overview

The JULIUS ASR has been developed as a research software for Japanese LVCSR since 1997, and the work was continued under IPA Japanese dictation toolkit project (1997-2000), Continuous Speech Recognition Consortium, Japan (CSRC) (2000-2003) and currently Interactive Speech Technology Consortium (ISTC).

Open-Source Large Vocabulary Continuous Speech Recognition Engine Julius (A. Lee et al., 2001) is a high-performance ASR decoder for researchers and developers, designed for real-time decoding and modularity. Julius is has proven to be very easy to implement and incorporate the desired features into integrated systems, because its decoder API is very well designed and the core engine is a separate C library. Moreover, Julius has low system requirements, a small memory footprint, a high-speed decoding and it can swap language models at run-time: all these features are crucial in a real-time integrated system handling several components. Its configuration is modular (i.e., each configuration file can embed another one covering only one particular aspect of the configuration). Most of the features available in other state-of-the-art decoders are also available for Julius, including major search techniques such as tree lexicon, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Finally, Julius integrates an GMM- and Energy-based VAD.

Main Features

Recognition output

Julius can produce Speech recognition output as an n-best list (i.e., the set of the n most probable sentences) or a lattice. A lattice (or Word Graph in Julius terminology) is an acyclic ordered graph, in which nodes represent words and edges represent transition probabilities (weighted by acoustic and language model scores). ASR hypothesis can be expressed as a Word Graph, that is a more powerful tool than n-best list for Spoken Language Understanding (SLU), since lattices provide a wider set of hypothesis from which to choose and a more accurate representation of the hypothesis space.

Multi-model recognition

Julius supports multi-model recognition, as explained in (Lee, Akinobu, 2010). This means that n > 1 configuration instances can be loaded and the Julius engine will output n results for a single audio input at the same time. To enable multi-model recognition, multiple search instances must be declared. A search instance is defined within a Julius configuration file, and links to an AM and an LM, with custom recognition parameters. Every ASR result comes from a single search instance. The multi-model recognition can be used in order to improve accuracy of the NLU module. Using multiple search instances at a time, one can keep the result from the search instance that has the greatest likelihood, or the one that is related to a specific modality. For example, the result for an instance associated with an LM built on a "greetings" set of sentences can be given higher confidence, because another component in the integrated system (such as the Dialogue Manager) expects that kind of communication from the user in that particular moment. Moreover, thanks to this feature, in the future it will also be possible to create models for input rejection, so that unwanted speech events can be detected and discharged as needed.

References

Lee, A., Kawahara, T., & Shikano, K. (2001), Julius - an open source real-time large vocabulary recognition engine, in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), 1691-1694.

Lee, Akinobu. (2010, May). Juliusbook. Retrieved from "http://sourceforge.jp/frs/redir.php?m=jaist&f=%2Fjulius%2F47534%2FJuliusbook-4.1.5.pdf."

For more information please contact :

Piero Cosi

Istituto di Scienze e Tecnologie della Cognizione - Sede Secondaria di Padova "ex Istituto di Fonetica e Dialettologia";
CNR di Padova (e-mail: piero.cosi@pd.istc.cnr.it).