Introduction
Mingus stands for "Modular Intonation Generation Using Syntax".
Intonation generation is a step in speech synthesis which provides
the intonation contour of the sentence or text to be synthesized.
A correct or at least acceptable intonation contour is essential for high-quality synthetic speech.
Mingus is a new approach to intonation generation which relies heavily on syntactic analysis.
It is a partial implementation of the tonal model of French intonation by
Piet Mertens
(1987,
1997,
1998).
Tonal models represent sentence intonation as a sequence of tones associated with the syllables in the utterance.
An important claim of the model is that the generation of intonation requires a
preliminary analysis identifying syntactic structure (constituent structure, dependency relations),
as well as particular syntactic constructions.
Follow this link to listen to examples illustrating this point.
The algorithm for intonation generation was first described
here.
A detailed description can be found
here, where it is combined with the syntactic analysis provided by
FIPS.
System architecture
The requirements of syntactic analysis and grapheme-to-phoneme conversion
as a preliminary to actual intonation generation, account for the fact
that a complete analysis system for French had to be assembled
in order to test the approach for "random text".
Therefore "Mingus TTS" also is a text-to-speech system for French.
In text-to-speech synthesis a computer program produces speech
starting from a plain text.
This is a highly complex application which consists of several building blocks.
Click on the picture to see a full-scale
block-diagram.
The Mingus TTS system uses the following components:
- the Morlex lexical database and software for stemming
- a parser for unification-based grammars, and a small grammar for a subset of French
- the grapheme-to-phoneme conversion of LiPSS
- the Mingus intonation generation based on syntax
- a pitch model, for tone-to-contour mapping
- a duration model, to determine the duration speech sounds
- the MBROLA speech synthesizer
Notes.
- The
FIPS
syntactic analysis for French has also been used as an alternative to Morlex lemmatisation, Vertex analysis and LiPSS grapheme-to-phoneme conversion.
- Mingus is programmed entirely in Prolog, using SWI-Prolog.
Audio samples
The following are utterances generated by the Mingus system starting from the written text only.
Click on the icon to listen to them.
The synthetic speech occasionally contains errors at the level of
the phoneme sequence which should be disregarded when evaluating the pitch contour.
-
"Si tu partais, je serais triste à mourir." (speaker fr1)
-
"Si tu partais, je serais triste à mourir." (speaker fr3)
-
"Les policiers ont retrouvé les traces que les ravisseurs avaient laissées."
Punctuation is taken into account: an exclamation sign
produces a focus (i.e. a "HL" tone) on the appropriate intonation group
in the sentence.
-
"L'herbe du jardin pousse lentement, décidément!"
The pitch model: tone to contour mapping
The pitch model maps the tone sequence to a contour, i.e. a sequence of pitch targets (fundamental frequency values) associated with positions within a particular sound.
The model allows for an independent control of
- pitch range: the melodic interval between the "low" and "high" levels.
- floor: the pitch at the bottom of speaker's tonal range,
which is reached at the end of a declarative sentence.
- slope: the slope of the "low" and "high" pitch levels, evolving in time. This accounts for the declination effect.
- ceiling: the top of the speaker's tonal range.
Here are some examples to illustrate this.
-
The sentence is repeated 3 times with an increasing value for the "low" level.
This property can be specified in the input:
Input: "<grid low=120> Depuis quelques jours, il est parti en vacances."
-
The sentence is repeated 3 times with a pitch range of 5, 7, and 9 ST.
Input: "<grid range=7> La synthèse de la parole à partir du texte."
-
The sentence is repeated 3 times with a flat, falling or rising slope (0, -3 and 3 ST, resp.).
Input: "<grid slope=0> Depuis quelques jours, il est parti en vacances."
Specifying natural intonation contours
Mingus uses syntactic structure to arrange words into intonation groups,
to assign boundary levels (and the corresponding tones) to these groups,
and to decide which parts receive a background intonation, on the basis
of their position in particular constructions.
This results in neutral, unmarked intonation contours.
Mingus also uses punctuation (final stop, exclamation mark, question mark)
to select particular tones.
Spontaneous speech, however, shows richer contours expressing a particular
meaning. This information is not provided by syntax, but it can be added
explicitly in the input using a particular annotation scheme.
The following samples illustrate this.
Evaluation
Audio examples demonstrating the importance of syntax for intonation generation are given
here.
Other projects on intonation generation for synthetic speech include: