Speech Synthesis Research - Italian TTS (by Piero Cosi)

Emotional Synthesis

Introdution
Goals
CART-based emotional prosodic modules
Voice Quality
FESTIVAL & MBROLA implementation issues
Processing of voice unit
Extensions to the Mbrola synthesizer
Affective/Expressive/Emotional mark-up languages (APML/VSML)
Audio/Visual Integration
Emotional DBs
Demo
Evaluation
Objective Evaluation
Subjective Evaluation
References
back to research

Introduction

The transmission of emotions in speech communication is a topic that has recently received considerable attention. Automatic speech recognition (ASR) and text-to-speech (TTS) synthesis are examples of popular fields in which the processing of emotions can have a substantial impact and can improve the effectiveness and naturalness of the man-machine interaction.

Speech synthesis is a by now stable technology that enables computers to talk. While existing synthesis techniques produce speech that is intelligible, few people would claim that listening to computer speech is natural or expressive. Therefore, in the last years, research in speech synthesis has been strongly focused on producing speech that sounds more natural or human-like, mainly with the aim of emulate the human behavior in man-machine communication interfaces. Meanwhile, the emotions and their role in human-to-human, human-to-machine (and vice versa) communication has become an interesting research topic. Recently, new expressive/emotive human-machine interfaces are being studied that try to simulate the human behavior while reproducing man-machine dialogues, and various attempts to incorporate the expression of emotions into synthetic speech have been made.

The topic of this work is an extension of our previous research on the development of a general data-driven procedure for creating a neutral “narrative-style” prosodic module for the Italian FESTIVAL Text-To-Speech (TTS) synthesizer, and it is focused on investigating and implementing new strategies for building a new emotional FESTIVAL TTS producing emotionally adequate speech starting from an APML/VSML tagged text. The new emotional prosodic modules, similarly to the neutral case, are still based on the “Classification And Regression Tree” (CART) theory. The extension to the emotional speech synthesis is obtained using a differential approach: the emotional prosodic module try to learn the differences between the neutral (without emotions) and the emotional prosodic data. Moreover, due to the fact that Voice Quality (VQ) is known to play an important role in emotive speech, a rule-based FESTIVAL-MBROLA VQ-modification module, that represents an easy way to modify the temporal and spectral characteristics of the TTS synthesized speech has also been implemented. Even if emotional synthesis still remains an attractive open issue, our preliminary evaluation results underline the effectiveness of the proposed solution.

Even if emotional synthesis still remains an attractive open issue, our preliminary evaluation results underline the effectiveness of the proposed solution.

Goals

The goal of this work is to investigate and implement strategies allowing a synthesizer to produce emotional speech. This goal is relevant to both Text to Speech (TTS) and Concept to Speech (CTS) synthesis, and there are many possible applications scenarios ranging from human-machine interaction in general to speaking interfaces for impaired users, electronic games, virtual agents, and story-telling scenarios..

CART-based emotional prosodic modules

The topic of this work is an extension of our previous research on the development of a neutral TTS.

A general statistical CART-based data-driven procedure for creating a “narrative-style” (neutral) prosodic module and various "emotion-style" prosodic modules were developed.

This work is focused on investigating and implementing new strategies for building a new emotional FESTIVAL TTS producing emotionally adequate speech starting from an APML/VSML tagged text.

The new emotional prosodic modules, similarly to the neutral case, are still based on the “Classification And Regression Tree” (CART) theory. The extension to the emotional speech synthesis is obtained using a differential approach: the emotional prosodic module try to learn the differences between the neutral (without emotions) and the emotional prosodic data (see the following Figure)

click on the figure to enlarge

Block diagram of the emotional differential approach

Voice Quality

Many of the researches in the field have emphasized the importance of prosodic features (e.g., speech rate, intensity contour, F0, F0 range) and the importance of the voice quality in the rendering of different emotions in verbal communication. In TTS technologies, voice processing algorithms for emotional speech synthesis have been mainly focusing on the control of phoneme duration and pitch, which are the principal parameters conveying the prosodic information but also Voice Quality (VQ) is known to play an important role in emotive speech.

On the side of voice quality transformations for speech synthesis, some recent studies have addressed the exploitation of source models within the framework of articulatory synthesis to control the characteristics of voice phonation. Recently, even more sophisticated transformations have been proposed, such as transformation of spectral features for speaker conversion.

Speech production in general, and emotional speech in particular, is characterized by a wide variety of phonation modalities. Voice quality, which is the term commonly used in the field, has an important role in the communication of emotions through speech, and nonmodal phonation modalities (soft, breathy, whispery, creaky, for example) are commonly found in emotional speech corpora. We discuss here a voice synthesis framework that allows to control a set of acoustic parameters which are relevant for the simulation of nonmodal voice qualities. The set of controls of the synthesizer includes standard controls for duration and pitch of the phonemes, and additional controls for intensity, spectral emphasis, fast and slow variations of the duration and amplitude of the waveform periods (for voiced frames), frequency axis warping for changing the formant position, and aspiration noise level. The following set of cues, which are among the ones that are most commonly found in investigations on emotive speech, have been selected for our analysis as voice quality correlates of emotions:

Shimmer and Jitter: the cycle-to-cycle variations of waveform amplitude and fundamental period respectively)
Harmonic-to-Noise ratio (HNR): the ratio of the energy of the harmonic part to the energy of the remaining part of the signal
Hammarberg Index (HammI): the difference between the energy maximum in the 0-2000 Hz frequency band and in the 2000-5000 Hz band
the drop-off of spectral energy above 1000 Hz (Do1000): the gradient of the least squares approximation of the spec- tral slope above 1000 Hz
the relative amount of energy in the high- (above 1000 Hz) versus the low-frequency range (up to 1000 Hz) of the voiced spectrum (Pe1000)
a spectral flatness measure (SFM): the ratio of the geometric to the arithmetic mean of the spectral energy distribution

Some guidelines are given to combine these signal transformations in the aim of reproducing some nonmodal voice qualities, including soft, loud, breathy, whispery, hoarse, and tremulous voice. These voice qualities differently characterize the emotional speech.

FESTIVAL & MBROLA implementation issues

As for the analysis of emotive speech and the extraction of emotional VQ indexes, the speech signal has been manually segmented and analysed by means of PRAAT and of Matlab routines.

A rule-based FESTIVAL-MBROLA VQ-modification module, that represents an easy way to modify the temporal and spectral characteristics of the TTS synthesized speech has been implemented. Our system is based in fact on the FESTIVAL speech synthesis framework and on the MBROLA diphone concatenation acoustic back-end.

Processing of voice unit

A graphic tool, displayed in the following Figure, was developed in Matlab for the off-line processing of short phonetic units (diphones). The tool is intended as a prototyping utility that allows to test new signal processing algorithms on each single diphone in a voice database. Some of the processing functions provided are: spectral emphasis/de-emphasis for spectral-slope control, spectral warping, aspiration noise modelling. No support was provided for pitch related cues, such as jitter, F0 modulations, etc., since diphone concatenation synthesizers relies on pitch control algorithms such as OLA-based processing routines. To date, the tool is compatible with MBROLA diphone databases.

click on the figure to enlarge

An interactive tool for the design of signal processing of diphones

Extensions to the Mbrola synthesizer

The diphone processing approach to voice quality control has been implemented by embedding the effects into the synthesizer adopted by us for the Italian speech synthesis, namely the MBROLA diphone concatenation synthesizer.

click on the figure to enlarge

FESTIVAL-MBROLA Emotional Synthesizer

We faced this task by allowing the online processing of the diphones as an intermediate step of the concatenation procedure (see the following Figure). This step has been implemented using both spectral processing based on DFT and Inverse-DFT transforms, and time-domain processing for pitch-related effects. The MBROLA speech synthesizer, which originally provides controls for pitch and phoneme duration, has been further extended to allow for control of a set of low-level acoustic parameters that can be combined to produce the desired voice quality effects. Time evolution of the parameters can be controlled over the single phoneme by means of control curves. The extended set includes gain ("Vol"), spectral tilt ("SpTilt"), shimmer ("Shim"), jitter ("Jit"), aspiration noise ("AspN"), F0 flutter ("F0Flut"), amplitude flutter ("AmpFlut"), spectral warping ("SpWarp"). A study on how these low-level effects combine to obtain the principal non-modal phonation types encountered in emotive speech is in progress, and more details are reported in a following section on Mark-up language extensions. Here we give a rough description on how these low-level acoustic controls were implemented:

- Gain ("Vol"): gain control is obtained by simple rescaling of the spectrum modulus
- Spectral tilt ("SpTilt"): the spectral balance is changed by a reshaping function in the frequency-domain that enhances or attenuates the low- and mid- frequency regions, thus changing the overall spectral tilt
- Shimmer ("Shim"): this is the difference between the amplitudes of consecutive periods; it is reproduced by introducing random amplitude modulations to each consecutive periods of the voiced part of phonemes
- Jitter ("Jit"): this is the period length difference between consecutive periods; it is reproduced by summing random pitch deviations to the pitch control curves computed by Mbrola's prosody matching module
- Aspiration noise ("AspN"): for voiced frames, aspiration noise is generated from the frame DFT transform, by inverse transformation of a high-pass filtered version of the spectral magnitude, and of a random spectrum phase
- F0 flutter ("F0Flut"): random low frequency fluctuations of the pitch are reproduced as for Jitter; the low frequency fluctuations are obtained by random noise band-pass filtering; the second order band-pass filter is tuned in the (4Hz-10Hz) range
- Amplitude flutter ("AmpFlut"): random low amplitude fluctuations are obtained as for Shimmer; the low frequency fluctuations are obtained by random noise band-pass filtering; the second order band-pass filter is tuned in the (4Hz-10Hz) range
- Spectral warping ("SpWarp"): the rising or lowering of upper formants is obtained by warping the frequency axis of the spectrum (through a bilinear transformation), and by interpolation of the resulting spectrum magnitude with respect to the DFT frequency bins

The Mbrola parser has been modified in order to permit the use of the low-level acoustic controls as general commands or as curves specified at the phoneme level (see the following example of an extended phonetic ".pho" file)

Vol=0 ;
SpTilt=0.0 ;
Shim=0.0 ;
Jit=0.0 ;
AspN=0.0 ;
F0Flut=0.0 ;
AmpFlut=0.0 ;
SpWarp=0.3
_       25 100 143
a1     309 5 151 20 142 40 150 60 141 80 126 100 116 Shim 0 0.1 100 0.2
v       85.3333 0 112 50 118 100 127 Shim 0 0.3 100 0.2
a       334 0 127 20 126 40 118.1250 60 113 80 106 100 148 Vol 0 -3 100 -5 Shim 0 0.2 100 0.4 Jit 0 0.06 100 0.06
_       10

The spectral warping command affects all phonemes with constant value 0.3, whereas different gain, shimmer and jitter control curves are specified for different phonemes.

Affective/Expressive/Emotional mark-up languages (APML/VSML)

Affective tags can be included in the input text to be converted. To this aim, FESTIVAL was provided with the support for the use of affective tags through ad-hoc mark-up languages (APML/VSML), and for driving the extended MBROLA synthesis engine through the generation of voice quality controls. The control of the acoustic characteristics of the voice signal is based on signal processing routines applied to the diphones before the concatenation step. Time-domain algorithms are used for the cues related to pitch control, whereas frequency-domain algorithms, based on FFT and inverse-FFT, are used for the cues related to the short-term spectral envelope of the signal.

The APML markup language for behavior specification allows to specify how to mark up the verbal part of a dialog so as to add to it the "meanings" that the graphical and the speech generation components of an animated agent need to produce the required expressions. So far, the language defines the components that may be useful to drive a face animation through the facial animation parameters (FAP) and facial display functions. A scheme for the extension of a previously developed affective presentation mark-up language (APML) has been studied. The extension of such language is intended to support voice specific controls. An extended version of the APML language has been included in the FESTIVAL speech synthesis environment, allowing the automatic generation of the extended .pho file from an APML tagged text with emotive tags. This module implements a three-level hierarchy in which the affective high-level attributes (e.g. <anger>, <joy>, <fear>, etc.) are described in terms of medium-level voice quality attributes defining the phonation type (e.g., <modal>, <soft>, <pressed>, etc.). These medium-level attributes are in turn described by a set of low-level acoustic attributes defining the perceptual correlates of the sound (e.g., <spectral tilt>, <shimmer >, <jitter>, etc.). The low-level acoustic attributes correspond to the acoustic controls that the extended Mbrola synthesizer can render through the sound processing procedure described above. In the following Figure, an example of a qualitative description of high level attributes through medium- and low- level attributes is shown.

Qualitative description of voice quality for "fear" in terms of acoustic features

This descriptive scheme has been implemented within FESTIVAL as a set of mappings between high-level and low-level descriptors. The implementation includes the use of envelope generators to produce time curves of each parameter. This APML extension allows the generation of emotive facial animation and speech synthesis, as schematically represented in the following Figure. The systematic generation of a set of audiovisual stimuli, that will be followed by a set of perceptual assessment tests, is work in progress.

Detailed representation of the "language processing" block implemented through
the APML extensions and the statistical CARTs for prosody

Given the hierarchical structure of the acoustic description of emotive voice, we performed preliminary experiments focused on the definition of speaker-independent rules to control voice quality within a text-to-speech synthesizer. Different sets of rules describing the high and medium level attributes in terms of low-level acoustic cues where used to generate the phonetic files to drive the extended MBROLA synthesizer. The following Table shows the low level components used to describe the given set of medium level descriptors soft, loud, whispery, tremulous, hoarse. The Table reports the control parameter and the activations level of each parameter. Values are in the range [0,1], and have different meanings for the different parameters. E.g., SpTilt=0 means maximal de-emphasis of higher frequency range, whereas SpTilt=0 means maximal emphasis; AspNoise=0 means absence of noise component, whereas AspNoise=1 means absence of voiced component, thus letting aspiration noise component alone; for F0Flut, Shimmer, and Jitter, value=0 means effect is off, whereas value=1 means effect is maximal; SpWarp=0 means maximal spectrum shrinking, and SpWarp=1 means maximal spectrum stretching.

Medium level voice quality description in terms of low-level acoustic components

	soft	loud	whispery	breathy	tremulous	hoarse
Low level Components	(SpTilt, 0.3)	(SpTilt, 0.7)	(AspNoise, 1.0)	(AspNoise, 0.2) (SpTilt, 0.05 )	(F0Flut, 0.9) (SpWarp, 0.3)	(Jitter, 0.3) (Shimmer, 0.1) (AspNoise, 0.2)

The following Figure shows the comparison of an example of synthesis obtained with tremulous voice with respect to a similar sentence obtained with modal voice. The tagged text used to generate the synthesis was as follows:

<vsml>
<performative type="inform">
<voqual type="modal" level="1.0">Questa e' la mia voce modale.</voqual>
<voqual type="tremulous" level="1.0">Questa e' la mia voce tremante.</voqual>
</performative>
</vsml>

click on the figure to enlarge

Spectrograms of the utterance "Questa è la mia voce modale" ("This is my modal voice") in the left panel, and of the utterance "Questa è la mia voce tremante"("This is my tremulous voice") in the right panel. Both utterances were obtained by the modified Festival/MBROLA TTS system using a VSML input text

Audio/Visual Integration

The FAP stream generation component of an MPEG-4 Talking Head such as LUCIA and the audio synthesis components have been integrated into a unique system able to produce the facial animation including emotive audio and video cues, from tagged text. The facial animation framework relies on previous studies for the realization of Italian talking heads. A schematic view of the whole system is shown in the following Figure. The modules used to produce the FAP control stream (AVENGINE), and the speech synthesis phonetic control stream (FESTIVAL), are synchronized through the phoneme duration information. The output control streams are in turn used to drive the audio and video rendering engines (i.e., the MBROLA speech synthesizer and the face model player).

click on the figure to enlarge

Block diagram of the system designed to produce the facial animation with emotive audio and video cues, from tagged text

Emotional DBs

In order to collect the necessary amount of emotional speech data to train the TTS prosodic models, a professional actor was asked to produce vocal expressions of emotion (often using standard verbal content) as based on emotion labels and/or typical scenarios. The Emotional-CARINI (E-Carini) database recorded for this study contains the recording of a novel (“Il Colombre” by Dino Buzzati) read and acted by a professional Italian actor, in different elicited emotions. According to the Ekman’s theory six basic emotions, plus a neutral one, have been taken into consideration: anger, disgust, fear, happiness, sadness, and surprise. The duration of the database is about 15 minutes for each emotion.

Evaluation

Objective Evaluation
An objective evaluation of the prosodic modules was performed by splitting both the Carini and E-Carini database in a training set (90%) and a test set (10%), and measuring the differences between the synthetic prosody and the actual prosody in the test set. To have a numerical idea of how good a prosodic module is, a first indication could be given by the RMSE and Correlation r between the original prosodic signal and the predicted one, other indices less significant are the absolute error |e| between the two prosody pattern.

The following Table shows the RMSE and the Correlation r between the original and the predicted values computed by the duration module for the different emotions. The mean and the variance of the absolute error |e| is also given. The values on the first three columns are expressed in z-score units, while the values on the last three columns are expressed in seconds.

Duration prediction results for the different emotions

Looking at the z-score RMSE and correlation columns the best performance is obtained for the neutral duration module, and the worst result is obtained for anger. The Surprise and Joy modules have an high correlation and their CART were the more complex in term of number of leafs. Sadness has a good performance too, and surprise, disgust and fear have mid-low scores. The following Table shows the results for the different emotions in the objective evaluation test for the intonation module. Also for the intonation the best performance has been obtained for neutral. As for the emotions, the best RMSE performances are obtained by disgust, and the worst result has been obtained for surprise.

Intonation prediction results for the different emotions on the test set
(the values on the first three columns are expressed in pitch normalized units)

Subjective Evaluation
The effectiveness of the prosodic modules and of the voice quality modifications was also assessed with perceptual tests aimed at evaluating: a) the single contribution on the emotional expressiveness carried out separately by the emotional prosodic modules and the emotive voice quality modifications, and b) the synergistic contribution given by the union of these two correlates of the emotive speech. Four types of test sentences were generated:
     (A)    neutral prosody without emotive VQ modifications;
     (B)    emotive prosody without emotive VQ modifications;
     (C)    neutral prosody with emotive VQ modifications;
     (D)    emotive prosody with emotive VQ modifications.
For each emotion and for each of these four conditions, two utterances were produced by the new emotional FESTIVAL-MBROLA TTS for a total of 48 sentences, which were presented in a randomized order to 40 listeners, native of different regions of Italy, who judged, knowing the target emotion, the level of acceptability, within a MOS scale (5=excellent, 4=good, 3=fair, 2=poor, 1=bad) with which the given emotion was expressed in the utterances. Results are summarized in Figure 5. It is evident that in B, C, and D cases the results are always better than those obtained in the A case, and this indicates that emotive modules were quite successful. D case shows always better MOS values and this gives an indication that the created emotive prosodic modules quite improve the acceptability of the emotional TTS. Emotive VQ modifications alone were superior to the neutral case a part from the fear and sadness case, indicating that with these emotional moods the chosen VQ acoustic modification should be modified.

Subjective Evaluation results (for A,B,C,D)

References

Cosi, P., Fusaro, A. & Tisato, G. (2003), LUCIA a New Italian Talking-Head Based on a Modified Cohen-Massaro’s Labial Coarticulation Model, in Proceedings of Eurospeech 2003, Geneva, Switzerland, September 1-4, 127-132.

d’Alessandro, C. & Doval, B. (1998), Experiments in voice quality modification of natural speech signals: the spectral approach, in Proceedings of the 3rd ESCA/COCOSDA Int. Workshop on Speech Synthesis, 277–282.

De Carolis, B., Pelachaud, C., Poggi, I., Steedman, M. (2004), APML, a Markup language for believable behavior generation, in Book Life-Like Characters, Tools, Affective functions, and Applications, H. Prendinger and M. Ishizuka Eds., Springer.

Drioli, C. & Avanzini, F. (2003), Non-modal voice synthesis by low-dimensional physical models, in Proc. of the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA), Florence, Italy, December 10-12.

Drioli, C., Tisato, G., Cosi, P. & Tesser, F. (2003), Emotions and voice quality: experiments with sinusoidal modeling, in Proc. of Voice Quality: Functions Analysis and Synthesis (VOQUAL) Workshop, Geneva, Switzerland, August 27-29, 127-132.

Gobl, C. & Chasaide, A. N. (2003), The role of the voice quality in communicating emotions, mood and attitude, Speech Communication, vol. 40, 189–212.

Johnstone, T. &. Scherer, K. R (1999), The effects of emotions on voice quality, in Proceedings of the XIV Int. Congress of Phonetic Sciences, 2029–2032.

Ladd, D. R., Silverman, K. E. A. , Tolkmitt, F. , Bergmann, G. , & Scherer, K. R. (1985), Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker affect, Journal of the Acoustical Society of America, vol. 78, n. 2, 435–444.

Magno Caldognetto, E., Cosi, P., Drioli, C., Tisato G., and Cavicchio, F. (2004), Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions, Speech Communication, vol. 44, n. 1-4 , 173-185.

Marchetto, E. (2004), Sistema per il controllo della voice quality nella sintesi del parlato emotivo, MThesis, Univ. of Padova, Italy.

Schröder , M. & Grice, M. (2003), Expressing vocal effort in concatenative speech, in Proceedings of 15th ICPhS, Barcelona, Spain, 2589–2592.

Tesser, F., Cosi, P., Drioli, C. & Tisato, G. (2004), Prosodic data driven modelling of a narrative style in FESTIVAL TTS, in Proc. of the 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, June 14-16, 185-190.

For more information please contact :

Piero Cosi

Istituto di Scienze e Tecnologie della Cognizione - Sezione di Padova "Fonetica e Dialettologia"
CNR di Padova (e-mail: cosi@pd.istc.cnr.it).