Italian Text-to-Speech Research (by Piero Cosi)

BACK | SSR Home | Research | Demos | Download ISTC-SSPD HOME

Speech Synthesis Research
Italian TTS: Italian synthesis with a linguistic processor

Research

Overview - Speech Generation Projects

ISTC-SSPD CNR has recently increased its activities in the area of speech generation, and is now focusing on two key areas of research and development, small footprint speech synthesis and very high quality application-specific synthesis. By way of introduction, speech generation is generally accomplished by one of the following three methods:

General-purpose concatenative synthesis. The system translates incoming text onto phoneme labels, stress and emphasis tags, and phrase break tags. This information is used to compute a target prosodic pattern (i.e., phoneme durations and pitch contour). Finally, signal processing methods retrieve acoustic units (fragments of speech correponding to short phoneme sequences such as diphones) from a stored inventory, modify the units so that they match the target prosody, and glue and smooth (concatenate) them together to form an output utterance.
Corpus based synthesis. Similar to general-purpose concatenative synthesis, except that the inventory consists of a large corpus of labeled speech, and that, instead of modifying the stored speech to match the target prosody, the corpus is searched for speech phoneme sequences whose prosodic patterns match the target prosody.
Phrase splicing. Stored prompts, sentence frames, and stored items used in the slots of these frames, are glued together.

The strengths and weaknesses of these methods are complementary. As for speech quality and scope, general-purpose concatenative synthesis is able to handle any input sentence but generally produces mediocre quality. Corpus based synthesis can produce very high quality, but only if its speech corpus contains the right phoneme sequences with the right prosody for a given input sentence. If the corpus contains the right phonemes but with the wrong prosody, the end result may locally (i.e., within the range of a phoneme sequence that was available in the corpus) sound quite good, but the utterance as a whole may have a bizarre sing-song quality with confusing accelerations and decelerations. And, obviously, phrase splicing methods produce completely natural speech, but can only say the pre-stored phrases or combinations of sentence frames and slot items; naturalness can be a problem if the slot items are not carefully matched to the sentence frames in terms of prosody.

An additional issue to consider is the amount of work required to build a system. The cost of generating a corpus or an acoustic unit inventory is significant, because besides making the speech recordings, each recording has to be analyzed microscopically by hand to determine phoneme boundaries, phoneme labels, and other tags. Such time consuming analysis is not necessary for phrase splicing methods. On the other hand, applications involving names may be prohibitive for phrase splicing methods (In Italy, there are ??1.5?? million distinct last names!).

A final consideration is size. Although the prices of memory and disk space are continually dropping, being able to have more channels on a given hardware platform translates directly into increased profits, and there is also an increasing interest in using speech synthesis on handheld devices. Thus, size still matters. Concatenative synthesis has the edge on size. Moreover, its quality limitations are less of a problem because the acoustic capabilities of handheld devices are themselves limited.

In other words, each of these methods has problems with quality, scope, the amount of resources required, or size. ISTC-SSPD CNR is focusing on the following projects:

Use of different algorithms for generating utterances that seamlessly combine synthetic speech, stored prompts, stored sentence frames, and stored items used in the slots of these frames.
Domain dependent intonation rules to generate substantially more natural intonation than can be obtained with general-purpose rules.
Acoustic inventory compression.
Prosodic speech modification beyond pitch and timing.
Emotive/expressive concatenative waveform synthesizer

Our software is integrated in an existing TTS engine that has a sufficiently rich internal data structure, such as Festival and OGI Re-LPC or MBROLA for Festival..

As a preview of things to come, here are some sentences produced using new acoustic inventories and signal processing components developed at ISTC-SSPD CNR, coupled with prosodic models from a commercial TTS engine (these are TTS, not copy synthesis):

Italian
- Sentence 1
- Sentence 2
- sample of speaker's original voice: A, B.

For more information please contact:

Piero Cosi - Istituto di Scienze e Tecnologie della Cognizione- Sezione di Padova "Fonetica e Dialettologia" del CNR (e-mail: piero.cosi@pd.istc.cnr.it)
Fabio Tesser - ITC-IRST, Istituto Trentino di Cultura - Istituto per la Ricerca Scientifica e Tecnologica (e-mail: tesser@itc.it)
Roberto Gretter - ITC-IRST, Istituto Trentino di Cultura - Istituto per la Ricerca Scientifica e Tecnologica (e-mail: gretter@itc.it)
Carlo Drioli - Istituto di Scienze e Tecnologie della Cognizione- Sezione di Padova "Fonetica e Dialettologia" del CNR (e-mail: drioli@pd.istc.cnr.it)
Graziano Tisato - Centro di Calcolo dell'Universita` di Padova - Istituto di Scienze e Tecnologie della Cognizione- Sezione di Padova "Fonetica e Dialettologia" del CNR (e-mail: tisato@pd.istc.cnr.it)

Riferimenti Bibliografici

P.Cosi, R.Gretter and F.Tesser, "Festival parla italiano", in Proceedings of GFS2000, XI Giornate del Gruppo di Fonetica Sperimentale, Padova 29-30 Novembre - 1 Dicembre, 2000, (in press).

FESTIVAL: Alan W. Black (awb@cs.cmu.edu), Paul Taylor (Paul.Taylor@ed.ac.uk), Richard Caley, Rob Clark (robert@cstr.ed.ac.uk) CSTR - Centre for Speech Technology - University of Edinburgh. WWW page: http://www.cstr.ed.ac.uk/projects/festival/.

M.Macon, A.Cronk and J.Wouters and A.Kain, "OGIresLPC: Diphone synthesiser using residual-excited linear prediction", num. CSE-97-007, Department of Computer Science, Oregon Graduate Institute of Science and Technology, Portland, OR, Sep, 1997, (macon@ece.ogi.edu), pdf.

T.Dutoit, H.Leich, "MBR-PSOLA : Text-To-Speech Synthesis based on an MBE Re-Synthesis of the Segments Database", Speech Communication, Elsevier Publisher, November 1993, vol. 13, n°3-4.

T. Dutoit, An Introduction to Text-To-Speech Synthesis¸ Kluwer Academic Publishers, 1996, 326 pp.

FESTVOX: Alan W Black (awb@cs.cmu.edu), Kevin A. Lenzo (lenzo@cs.cmu.edu) Speech Group at Carnegie Mellon University. WWW page: http://www.festvox.org/.

MPIRO: Multilingual Personalized Information Objects. European Project IST-1999-10982 Version : 5. WWW page: http://www.ltg.ed.ac.uk/mpiro/

For more information please contact

Piero Cosi - Istituto di Scienze e Tecnologie della Cognizione- Sezione di Padova "Fonetica e Dialettologia" del CNR (e-mail: piero.cosi@pd.istc.cnr.it)
Fabio Tesser - ITC-IRST, Istituto Trentino di Cultura - Istituto per la Ricerca Scientifica e Tecnologica (e-mail: tesser@itc.it)
Roberto Gretter - ITC-IRST, Istituto Trentino di Cultura - Istituto per la Ricerca Scientifica e Tecnologica (e-mail: gretter@itc.it)
Carlo Drioli - Istituto di Scienze e Tecnologie della Cognizione- Sezione di Padova "Fonetica e Dialettologia" del CNR (e-mail: drioli@pd.istc.cnr.it)
Graziano Tisato - ITC-IRST, Istituto Trentino di Cultura - Istituto per la Ricerca Scientifica e Tecnologica (e-mail: tisato@pd.istc.cnr.it)

BACK | SSR Home | Research | Demos | Download ISTC-SSPD HOME