Mingus


Introduction

Mingus stands for "Modular Intonation Generation Using Syntax".

Intonation generation is a step in speech synthesis which provides the intonation contour of the sentence or text to be synthesized. A correct or at least acceptable intonation contour is essential for high-quality synthetic speech.

Mingus is a new approach to intonation generation which relies heavily on syntactic analysis. It is a partial implementation of the tonal model of French intonation by Piet Mertens (1987, 1997, 1998). Tonal models represent sentence intonation as a sequence of tones associated with the syllables in the utterance.

An important claim of the model is that the generation of intonation requires a preliminary analysis identifying syntactic structure (constituent structure, dependency relations), as well as particular syntactic constructions. Follow this link to listen to examples illustrating this point.

The algorithm for intonation generation was first described here. A detailed description can be found here, where it is combined with the syntactic analysis provided by FIPS.

System architecture

The requirements of syntactic analysis and grapheme-to-phoneme conversion as a preliminary to actual intonation generation, account for the fact that a complete analysis system for French had to be assembled in order to test the approach for "random text". Therefore "Mingus TTS" also is a text-to-speech system for French.

In text-to-speech synthesis a computer program produces speech starting from a plain text. This is a highly complex application which consists of several building blocks. Click on the picture to see a full-scale block-diagram.

Block Diagram The Mingus TTS system uses the following components:

Notes.

Audio samples

The following are utterances generated by the Mingus system starting from the written text only. Click on the icon to listen to them.

The synthetic speech occasionally contains errors at the level of the phoneme sequence which should be disregarded when evaluating the pitch contour.

Punctuation is taken into account: an exclamation sign produces a focus (i.e. a "HL" tone) on the appropriate intonation group in the sentence.

The pitch model: tone to contour mapping

The pitch model maps the tone sequence to a contour, i.e. a sequence of pitch targets (fundamental frequency values) associated with positions within a particular sound. The model allows for an independent control of Here are some examples to illustrate this.

Specifying natural intonation contours

Mingus uses syntactic structure to arrange words into intonation groups, to assign boundary levels (and the corresponding tones) to these groups, and to decide which parts receive a background intonation, on the basis of their position in particular constructions. This results in neutral, unmarked intonation contours. Mingus also uses punctuation (final stop, exclamation mark, question mark) to select particular tones.

Spontaneous speech, however, shows richer contours expressing a particular meaning. This information is not provided by syntax, but it can be added explicitly in the input using a particular annotation scheme. The following samples illustrate this.


Evaluation

Audio examples demonstrating the importance of syntax for intonation generation are given here.

Other projects on intonation generation for synthetic speech include:


© 1999-2003, P. Mertens
Last updated: December 29, 2008.