BACK | SSR Home | Research | Demos | Download ISTC-SSPD HOME
|
Speech Synthesis
Research |
|
|
ISTC-SSPD CNR has recently increased its activities in the area of speech generation, and is now focusing on two key areas of research and development, small footprint speech synthesis and very high quality application-specific synthesis. By way of introduction, speech generation is generally accomplished by one of the following three methods:
The strengths and weaknesses of these methods are complementary. As for speech quality and scope, general-purpose concatenative synthesis is able to handle any input sentence but generally produces mediocre quality. Corpus based synthesis can produce very high quality, but only if its speech corpus contains the right phoneme sequences with the right prosody for a given input sentence. If the corpus contains the right phonemes but with the wrong prosody, the end result may locally (i.e., within the range of a phoneme sequence that was available in the corpus) sound quite good, but the utterance as a whole may have a bizarre sing-song quality with confusing accelerations and decelerations. And, obviously, phrase splicing methods produce completely natural speech, but can only say the pre-stored phrases or combinations of sentence frames and slot items; naturalness can be a problem if the slot items are not carefully matched to the sentence frames in terms of prosody.
An additional issue to consider is the amount of work required to build a system. The cost of generating a corpus or an acoustic unit inventory is significant, because besides making the speech recordings, each recording has to be analyzed microscopically by hand to determine phoneme boundaries, phoneme labels, and other tags. Such time consuming analysis is not necessary for phrase splicing methods. On the other hand, applications involving names may be prohibitive for phrase splicing methods (In Italy, there are ??1.5?? million distinct last names!).
A final consideration is size. Although the prices of memory and disk space are continually dropping, being able to have more channels on a given hardware platform translates directly into increased profits, and there is also an increasing interest in using speech synthesis on handheld devices. Thus, size still matters. Concatenative synthesis has the edge on size. Moreover, its quality limitations are less of a problem because the acoustic capabilities of handheld devices are themselves limited.
In other words, each of these methods has problems with quality, scope, the amount of resources required, or size. ISTC-SSPD CNR is focusing on the following projects:
Our software is integrated in an existing TTS engine that has a
sufficiently rich internal data structure, such as Festival
and OGI
Re-LPC or MBROLA
for Festival..
As a preview of things to come, here are some sentences produced
using new acoustic inventories and signal processing components
developed at ISTC-SSPD CNR, coupled with prosodic models from a commercial TTS
engine (these are TTS, not copy synthesis):
For more information please contact:
Piero Cosi
- Istituto di Scienze e Tecnologie della Cognizione- Sezione di
Padova "Fonetica e Dialettologia" del CNR (e-mail:
piero.cosi@pd.istc.cnr.it) |
P.Cosi, R.Gretter and F.Tesser, "Festival parla italiano", in Proceedings of GFS2000, XI Giornate del Gruppo di Fonetica Sperimentale, Padova 29-30 Novembre - 1 Dicembre, 2000, (in press).
FESTIVAL: Alan W. Black (awb@cs.cmu.edu), Paul Taylor (Paul.Taylor@ed.ac.uk), Richard Caley, Rob Clark (robert@cstr.ed.ac.uk) CSTR - Centre for Speech Technology - University of Edinburgh. WWW page: http://www.cstr.ed.ac.uk/projects/festival/.
M.Macon, A.Cronk and J.Wouters and A.Kain, "OGIresLPC: Diphone synthesiser using residual-excited linear prediction", num. CSE-97-007, Department of Computer Science, Oregon Graduate Institute of Science and Technology, Portland, OR, Sep, 1997, (macon@ece.ogi.edu), pdf.
T.Dutoit, H.Leich, "MBR-PSOLA : Text-To-Speech Synthesis based on an MBE Re-Synthesis of the Segments Database", Speech Communication, Elsevier Publisher, November 1993, vol. 13, n°3-4.
T. Dutoit, An Introduction to Text-To-Speech Synthesis¸ Kluwer Academic Publishers, 1996, 326 pp.
FESTVOX: Alan W Black (awb@cs.cmu.edu), Kevin A. Lenzo (lenzo@cs.cmu.edu) Speech Group at Carnegie Mellon University. WWW page: http://www.festvox.org/.
MPIRO: Multilingual Personalized Information Objects. European Project IST-1999-10982 Version : 5. WWW page: http://www.ltg.ed.ac.uk/mpiro/
Piero Cosi
- Istituto di Scienze e Tecnologie della Cognizione- Sezione di
Padova "Fonetica e Dialettologia" del CNR (e-mail:
piero.cosi@pd.istc.cnr.it) |
BACK | SSR Home | Research | Demos | Download ISTC-SSPD HOME