Overview
The JULIUS ASR has been developed as a research software for Japanese
LVCSR since 1997, and the work was continued under IPA Japanese dictation toolkit project (1997-2000),
Continuous Speech Recognition Consortium, Japan (CSRC) (2000-2003)
and currently
Interactive Speech Technology Consortium (ISTC).
Open-Source Large Vocabulary Continuous Speech Recognition Engine Julius (A. Lee et al., 2001)
is a high-performance ASR decoder for researchers and developers, designed for
real-time decoding and modularity.
Julius is has proven to be very easy to implement and incorporate the desired features into integrated systems,
because its decoder API is very well designed and the core engine is a separate C library. Moreover,
Julius has low system requirements, a small memory footprint, a high-speed decoding and it
can swap language models at run-time: all these features are crucial in a real-time integrated
system handling several components. Its configuration is modular (i.e., each configuration
file can embed another one covering only one particular aspect of the configuration).
Most of the features available in other state-of-the-art decoders are also available for
Julius, including major search techniques such as tree lexicon, N-gram factoring, cross-word
context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection,
etc. Finally, Julius integrates an GMM- and Energy-based VAD.
Main Features
Recognition output
Julius can produce Speech recognition output as an n-best list
(i.e., the set of the n most probable sentences) or a lattice. A lattice (or Word Graph in Julius
terminology) is an acyclic ordered graph, in which nodes represent words and edges represent
transition probabilities (weighted by acoustic and language model scores). ASR hypothesis
can be expressed as a Word Graph, that is a more powerful tool than n-best list for Spoken
Language Understanding (SLU), since lattices provide a wider set of hypothesis from which
to choose and a more accurate representation of the hypothesis space.
Multi-model recognition
Julius supports multi-model recognition, as explained in (Lee, Akinobu, 2010). This
means that n > 1 configuration instances can be loaded and the Julius engine will output n
results for a single audio input at the same time.
To enable multi-model recognition, multiple search instances must be declared. A search
instance is defined within a Julius configuration file, and links to an AM and an LM, with
custom recognition parameters. Every ASR result comes from a single search instance.
The multi-model recognition can be used in order to improve accuracy of the NLU module.
Using multiple search instances at a time, one can keep the result
from the search instance that has the greatest likelihood, or the one that is related to a specific
modality. For example, the result for an instance associated with an LM built on a "greetings"
set of sentences can be given higher confidence, because another component in the integrated
system (such as the Dialogue Manager) expects that kind of communication from the user in
that particular moment.
Moreover, thanks to this feature, in the future it will also be possible to create models for
input rejection, so that unwanted speech events can be detected and discharged as needed.
References
Lee, A., Kawahara, T., & Shikano, K. (2001), Julius - an open source real-time large vocabulary recognition engine, in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), 1691-1694.
Lee, Akinobu. (2010, May). Juliusbook. Retrieved from "http://sourceforge.jp/frs/redir.php?m=jaist&f=%2Fjulius%2F47534%2FJuliusbook-4.1.5.pdf."
For more information please contact :
Piero Cosi
|
Istituto di Scienze e Tecnologie della Cognizione - Sede Secondaria di Padova "ex Istituto di Fonetica e Dialettologia"; CNR di
Padova (e-mail: piero.cosi@pd.istc.cnr.it). |
