Cosi: Bimodal Phonetic Recognition

| HOME | Simpa |

Bimodal Phonetic Recognition

The system described takes advantage of jaw and lip reading capability, making use of ELITE (Magno Caldognetto et al., 1989) in conjunction with an auditory model of speech processing (Seneff, 1988) which have shown great robustness in noisy condition (Cosi, 1992). The speech signal, acquired in synchrony with the articulatory data, is prefiltered and sampled at 16 KHz, and a joint synchrony/mean-rate auditory model of speech processing (Seneff, 1988) is applied producing 80 spectral-like parameters at 500 Hz frame rate. In the experiments being described, spectral-like parameters and frame rate have been reduced to 40 and 250Hz respectively in order to speeding up the system training time. Input stimuli are segmented by SLAM, a recently developed semi-automatic segmentation and labeling tool (Cosi, 1993) working on auditory model parameters. Both audio and visual parameters, in a single or joint fashion, are used to train, by means of the Back Propagation for Sequences (BPS) (Gori, Bengio and De Mori, 1989) algorithm, an artificial Recurrent Neural Network (RNN) to recognize the input stimuli.

Images

ELITE system

Block Diagram of the Bimodal Recognition System

RNN Structures

References

Magno Caldognetto E., Vagges K., Borghese N.A., and Ferrigno G., (1989) Automatic Analysis of Lips and Jaw Kinematics in VCV Sequences, Proc. of Eurospeech 1989, Vol. 2:453-456.

Seneff S. (1988), "A Joint Synchrony/Mean Rate Model of Auditory Speech Processing", Journal of Phonetics, 16, 1988, pp. 55 76.

Cosi P. (1992), Auditory Modelling for Speech Analysis and Recognition. In M. Cooke, S. Beet and M. Crawford (Eds.), Visual Representation of Speech Signals. John Wiley and Sons, pp.205-212.

Cosi P. (1993), "SLAM: Segmentation and Labelling Automatic Module", Proc. Eurospeech-93, Berlin, 21-23 September, 1993, pp. 665-668.

Gori M., Bengio Y. and De Mori R. (1989), "BPS: A Learning Algorithm for Capturing the Dynamical Nature of Speech", Proc. IEEE IJCNN89, Washington, June 18 22, 1989, Vol. II, pp. 417 432.

P. Cosi, E. Magno Caldognetto, K. Vagges, G.A. Mian and M. Contolini (1994), "Bimodal Recognition Experiments with Recurrent Neural Networks", in Proceedings of IEEE ICASSP-94, International Conference on Acoustic Speech and Signal Processing, Adelaide. Australia, 19-22 April, 1994, paper 20.8.

P. Cosi, G.A. Mian and M. Contolini (1994), "Speaker Independent Phonetic Recognition Using Auditory Modelling and Recurrent Neural Networks", in Proceedings of ICANN-94, International Conference on Artificial Neural Networks, Sorrento, Italy, 26-29 May, 1994, pp. 925-928.

P. Cosi, M. Dugatto, F.E. Ferrero, E. Magno Caldognetto and K. Vagges (1995), "Bimodal Recognition of Italian Plosives", in Proceedings of XIII International Congress of Phonetic Sciences, ICPhS-95, Stochkolm, 14-18 August, 1995, Vol. 4, pp. 260-263.

P. Cosi and E. Magno Caldognetto (1995), "Spatio-Temporal Characteristics of Lips and Jaw Movements: Experimental Data and Bimodal Phonetic Recognition Applications", (to be Published) in Proceedings of NATO Summer School, Bonas, France, 1995.

P. Cosi, M. Dugatto, F.E. Ferrero, E. Magno Caldognetto and K. Vagges, "Phonetic Recognition by Recurrent Neural Networks Working on Audio and Visual Information", (submitted to Speech Communication Journal, North Holland).

| HOME | Simpa |

| IFD HOME |

| BACK |