Italian Text-to-Speech (by Piero Cosi)

FESTIVAL

Italian Diphone Database
Text Linguistic Analysis
Prosodic Analysis
Waveform Synthesiser
RE-LPC Festival
OGI RE-LPC for Festival
MBROLA
References (FESTIVAL)
References (ISTC-SPFD CNR)

FESTIVAL is a general multi-lingual speech synthesis system developed at the Centre for Speech Technology Research - CSTR in Edinmburgh, Scotland, UK.

It offers a full text to speech system with various APIs, as well an environment for development and research of speech synthesis techniques. It is written in C++ with a Scheme-based command interpreter for general control.

FESTIVAL is a text-to-speech (TTS) system. Whereby unrestricted text is transformed into speech. FESTIVAL is diphone-based synthesis system utilizing the Residual-Excited LPC synthesis tecnique. In diphone synthesis, speech is created by the recombination of previously stored samples of speech, called diphones. The challenges of diphone synthesis include producing a natural sounding set of diphones, ensuring they can be joined smoothly, and manipulating the pitch and duration of the sounds.

FESTIVAL is a multi-lingual system suitable for research, development and general use. It is freely available for research and educational use. CSTR has expanded its range of languages, in fact, there are TTS systems for English, Spanish and Welsh and, finally, FESTIVAL speaks Italian!

Italian Diphone Database

A recording of a new Italian syntesis database with a male speaker (P.C.) has been executed at ISTC-SPFD CNR, while a similar recording with a female speaker (L.P.) has been executed at ITC-irst. The larynograph signal (electro-glottal graph - EGG) has been recorded too for a better pitch extraction. The speaker reads a set of carefully designed nonsense or true Italian words embedded in syntactically correct but semantically incorrect sentences which have been constructed to elicit particular phonetic effects. This technique ensures that the collected database only contains the required variability. Various scripts for automatic segmentation, diphone extraction and LPC analysis have been developed with the function of making faster the creation of a new voice.

The database has been formatted in FESTIVAL, OGI Residual LPC and MBROLA synthesis format.

Text/Linguistic Analysis

Various modules have been constructed for:

lnput texty-string processing;
equivalent characters mapping;
distinction and processing of numerical data and function word;
letter to sound module: phonetic transcription;
syllabification;
compilation of a lexicon: it contains approximately 500000 word-forms with their part-of-speech (POS) specified.

click to enlarge

Prosodic Analysis

The control of prosody has a central role in TTS synthesis, in fact, one the most pressing problems in TTS is that of intonation. This divides into two areas: deciding what intonation the system should use for an utterance and the realisation of that intonation into a fundamental frequency contour. Traditionally two approaches have been used for the front end (that is, "text" or "linguistic" analysis) of speech synthesizers. The first type uses sophisticated rules to parse and tag the text. Although theoretically justified, algorithms developed to date have been so unreliable and unwieldy, that many have tried the second approach, whereby a front end is hacked together and very simple (sometimes statistical rules) are used to detect where phrasing should be placed etc.

Up to now this is the week part of the Italian system, in fact, it is still "on construction". A prosodic duration module has been designed to superimpose specific duration to each diphone. A phone standard duration has been determined for each diphone from a fluent-speech database kindly provided by ITC-IRST, and these durations are modified on the basis of the phone position inside the phrase and the word. Two simple prosodic intonation modules, one for declaratory sentences and the other for question sentences, have been built making use of the stress cue and of the function-word cue previously obtained.

click to enlarge

Waveform Synthesizer

Various waveform synthesizer have been utilized:
FESTIVAL diphone based residual excited LPC
	FESTIVAL is diphone-based synthesis system utilizing the Residual-Exited LPC synthesis tecnique. (a new Italian FESTIVAL database, developed by ISTC-SPFD, is now available for download)
OGI diphone based residual excited LPC for FESTIVAL
	OGI RE-LPC is diphone-based synthesis system utilizing a new OGI specific Residual-Exited LPC synthesis engine. (a new Italian OGI RE-LPC for FESTIVAL database, developed by ISTC-SPFD, is now available for download)
MBROLA is a diphone based PCM 16bit/16kHz synthesizer.
	MBROLA is a speech synthesizer based on the concatenation of diphones coded as PCM 16 bit linear signals. It takes a list of phonemes as input, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces speech samples on 16 bits (linear), at the sampling frequency of the diphone database used (it is therefore NOT a Text-To-Speech (TTS)synthesizer, since it does not accept raw text as input). This synthesizer is provided for free, for non commercial, non military applications only. (a new Italian MBROLA database, developed by ISTC-SPFD CNR, is now available for download)

References

"Festival Speaks Italian!", P.Cosi, F. Tesser, R. Gretter and C. Avesani, Proceedings Eurospeech 2001, Aalborg, Denmark, September 3-7, 2001, pp. 509-512. (pdf)
"FESTIVAL parla italiano!", P.Cosi, R.Gretter and F.Tesser, Atti XI Giornate di Studio del G.F.S. - Multimodalità e Multimeialità nella Comunicazione, Padova, Italy, November 29-30, December 1, 2000, UNIPRESS, Padova, 2001, pp. 235-242. (pdf)
"Recenti sviluppi di FESTIVAL per l'italiano, P.Cosi, R.Gretter and F.Tesser., Proceedings XII Giornate di Studio del G.F.S., Macerata, Italy, December 13-15, 2001, pp. 251-256. (pdf)
"On the Use of Cart-Tree for Prosodic Predictions in the Italian Festival TTS", P.Cosi, C. Avesani, F.Tesser, R.Gretter, F.Pianesi, in Voce, Canto, Parlato - Studi in onore di Franco Ferrero, E. Caldognetto Magno, P. Cosi, A. Zamboni editori, UNIPRESS, Padova, 2002, pp. 73-81. (pdf)
“Prosodic Data-Driven Modelling of Narrative Style in FESTIVAL TTS”, Tesser F., Cosi P., Drioli C., Tisato G., in CDRom Proceedings of 5th ISCA Speech Synthesis Workshop, 14th-16th June 2004, Carnegie Mellon University, Pittsburgh USA, (CDRom). (pdf)
“Modello prosodico “data-driven” di festival per l’italiano”,Tesser F., Cosi P., Mana N., Avesani C., Gretter R., Pianesi F., in Proceedings of XIV Giornate di Studio del G.F.S., Viterbo, Italy, December 4-6, 2003, Volume XXXI Collana degli Atti dell’Associazione Italiana di Acustica, Settembre 2004, pp.273-278. (pdf)
“Emotional Festival-Mbrola TTS Synthesis”, Tesser F., Cosi P., Drioli C., Tisato G., in CD Proceedings INTERSPEECH 2005, Lisbon, Portugal, 2005, pp. 505-508. (pdf)
“Modelli prosodici emotivi di festival in italiano”, Tesser F., Cosi P., Drioli C., Tisato G., in CD Rom Proceedings of AISV 2004, 1st Conference of Associazione Italiana di Scienze della Voce, Padova, Italy, December 2-4, 2004, EDK Editore s.r.l., Padova, 2005, pp. 799-806. (pdf)
“GMM modelling of voice quality for FESTIVAL/MBROLA emotive TTS synthesis”, Nicolao M., Drioli C., Cosi P., in Proceedings of INTERSPEECH 2006, Pittsburgh, Pennsylvania, USA, 17-21 September, 2006, pp. 1794-1797. (pdf)
“Sintesi Vocale Concatenativa per l’italiano Tramite Modello Sinusoidale”, Sommavilla G., Drioli C., Cosi P., in CD-Rom Proceedings of AISV 2005, 2nd Conference of Associazione Italiana di Scienze della Voce, Salerno, Italy, Novembre 30 - December 2, 2005, EDK Editore s.r.l., Padova, 2006, pp. 761-772. (pdf)
“SMS-FESTIVAL: un nuovo ambiente di lavoro per la sintesi vocale da testo scritto”, Sommavilla G., Cosi P., Drioli C., Paci G. , in CD-Rom Proceedings of AISV 2006, 3rd Conference of Associazione Italiana di Scienze della Voce, "Scienze Vocali e del Linguaggio Metodologie di Valutazione e Risorse Linguistiche", Pantè di Povo TRENTO, 29-30 Novembre - 1 Dicembre 2006, EDK Editore s.r.l., Padova, 2007, pp. 347-352. (pdf)
“SMS-FESTIVAL: a New TTS Framework”, in Manfredi C. (editor), Sommavilla G., Cosi P., Drioli C., Paci G.,Proceeding of MAVEBA 2007, 5th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, December 13 - 15, 2007, Firenze, Italy, pp. 89-92. (pdf).
“FESTIVAL E LUCIA: TTS (Text-To-Speech) e IVA (Intelligent Virtual Agent) al servizio della didattica dei disabili”, Cosi P., Magno Caldognetto E., Proceedings of 3rd Convegno Internazionale “Progresso e Innovazioni Tecnologiche nella riabilitazione dell'età evolutiva”, Napoli, 22 Giugno 2007, 2008, (to be printed). (pdf)

FESTIVAL: Alan W. Black (awb@cs.cmu.edu), Paul Taylor (Paul.Taylor@ed.ac.uk), Richard Caley, Rob Clark (robert@cstr.ed.ac.uk) CSTR - Centre for Speech Technology - University of Edinburgh. WWW page: http://www.cstr.ed.ac.uk/projects/festival/.
M.Macon, A.Cronk and J.Wouters and A.Kain, "OGIresLPC: Diphone synthesiser using residual-excited linear prediction", num. CSE-97-007, Department of Computer Science, Oregon Graduate Institute of Science and Technology, Portland, OR, Sep, 1997, (macon@ece.ogi.edu)
T.Dutoit, H.Leich, "MBR-PSOLA : Text-To-Speech Synthesis based on an MBE Re-Synthesis of the Segments Database", Speech Communication, Elsevier Publisher, November 1993, vol. 13, n°3-4.
T. Dutoit, An Introduction to Text-To-Speech Synthesis¸ Kluwer Academic Publishers, 1996, 326 pp.
FESTVOX: Alan W Black (awb@cs.cmu.edu), Kevin A. Lenzo (lenzo@cs.cmu.edu) Speech Group at Carnegie Mellon University. WWW page: http://www.festvox.org/.
MPIRO: Multilingual Personalized Information Objects. European Project IST-1999-10982 Version : 5. WWW page: http://www.ltg.ed.ac.uk/mpiro/)

now FESTIVAL in Italian is also emotive/expressive!

Part of this work has been sponsored by:

		MPIRO: Multilingual Personalized Information Objects European Project IST-1999-10982
		TICCA: Technologies for Interactive Cognitive and Communicative Agents A joint project between ITC-irst and CNR-ISTC
		PF-STAR: Preparing Future multiSensorial inTeraction reseARch European Project IST-2001-37599

Authors

Piero Cosi	ISTC-SPFD CNR Istituto di Scienze e Tecnologie della Cognizione Sezione di Padova "Fonetica e Dialettologia" Consiglio Nazionale delle Ricerche e-mail: cosi@pd.istc.cnr.it	Project Leader
Fabio Tesser	ITC-IRST, Istituto Trentino di Cultura Istituto per la Ricerca Scientifica e Tecnologica e-mail: tesser@itc.it
with the collaboration of
Carlo Drioli	ISTC-SPFD CNR e-mail: drioli@pd.istc.cnr.it
Graziano Tisato	ISTC-SPFD CNR e-mail: tisato@pd.istc.cnr.it
Roberto Gretter	ITC-IRST, Istituto Trentino di Cultura Istituto per la Ricerca Scientifica e Tecnologica e-mail: gretter@itc.it