Prosogram, v2.13

Pitch contour stylization based on a tonal perception model

by Piet Mertens

Prosogram is a tool for the analysis and transcription of pitch variations in speech. Its stylization simulates the auditory perception of pitch by the listener. A key element in tonal perception is the segmentation of speech into syllable-sized elements, resulting from changes in the spectrum (sound timbre) and intensity. The tool also provides measurements of prosodic features for individual syllables (such a duration, pitch, pitch movement direction and size), as well as prosodic properties of longer stretches of speech (such as speech rate, proportion of silent pauses, pitch range, and pitch trajectory). The tool can easily interact with other software tools. It is used as the first step in automatic phonological transcription of intonation, the detection of sentence stress and intonation boundaries.

Contents


The first illustration shows a light Prosogram with the stylization (black lines) and the pitch range (red horizontal lines indicating top, median and bottom). The annotations of sounds, syllables and words are provided by the corpus.

Wide, light, with pitch range
Wide Rich Prosogram, with pitch range

The next illustration shows a rich Prosogram, which adds the parameters of F0 (blue line), intensity (green line), and voicing, as well as the segmentation, and the calibration of X and Y axes (in ST relative to 1 Hz, and in Hz). The vertical dotted lines correspond to the segmentation boundaries in the annotation.

Wide, rich
Wide Rich Prosogram

The third illustration shows a light Prosogram, in a more compact size.

Compact, light
Compact Prosogram

The next figure shows a Prosogram using automatic segmentation into syllable-sized units. Thy magenta curve shows the intensity of the band-pass filtered speech signal, on which this segmentation is based.

Automatic segmentation
Wide, rich Prosogram, using automatic segmentation

The last figure shows the screen of the interactive Prosogram. Here the user can interactively browse the speech signal and its stylization, play back parts (syllables, words...), and resynthesize the signal with the stylized pitch. (The tonal annotation in tier "polytonia" is obtained using the Polytonia script, which is not part of Prosogram.)

Interactive Prosogram window
Interactive Mode




Rationale

Many phoneticians use the fundamental frequency (F0) curve to represent pitch contours in speech. F0 is an acoustic parameter; it provides useful information about the acoustic properties of the speech signal. But it certainly is not the most accurate representation of the intonation contour as it is perceived by human listeners.

In the seventies, pitch contour stylization was introduced as a way to simplify the F0 curve to those aspects which are potentially relevant for speech communication. The approach originates from work by J. 't Hart, R. Collier, and A. Cohen at the I.P.O. (Institute for Perception Research) in Eindhoven, and was further improved by D. Hermes in the '80 and '90. Other types of stylization have been proposed, such as the Momel system by D. Hirst, R. Espesser (1993) from Aix-en-Provence. However, most of these stylization approaches are based on statistical or mathematical properties of the F0 data and ignore the facts of pitch perception.

It is well known that the auditory perception of pitch variations depends on many factors other than F0 variation itself. In 1995 a stylization based on the simulation of tonal perception was proposed by Ch. d'Alessandro & P. Mertens (Mertens & d'Alessandro, 1995, d'Alessandro & Mertens, 1995). The purpose of this stylization is to provide a representation which approximates the image in the listener's auditory memory. This tonal perception model was validated in listening experiments using stimuli resynthesized using the stylized contour (Mertens et al, 1997).

This approach may be used to obtain a low-level transcription of pitch level and pitch movement and. It requires a segmentation of the speech signal into syllable-sized units, motivated by phonetic, acoustic or perceptual properties. The Prosogram uses various types of segmentation:

The stylization is applied to the F0 curve of those segmented units, which are approximations of the more sonorous part of the syllable.

How does it work?

The analysis includes several processing steps.

How is it implemented?

The system is implemented as a Praat script. Praat is a tool for acoustic and phonetic research, written by Paul Boersma and David Weenink, of the Institute of Phonetic Sciences in Amsterdam. The choice of Praat is motivated by the fact that it is powerful, user-friendly, programmable, freely available, running on many platforms, and actively maintained.

How to obtain the phonetic segmentation ?

A suitable segmentation can be obtained in various ways.

Illustrations

A small corpus of spoken French was processed to illustrate the results obtained with the transcription tool. The corpus consists of about 4 minutes of an interview between Fayard and Benoîte Groult broadcasted on Radio de la Suisse Romande.

Audio files

Transciptions (Prosograms) (In Acrobat PDF format. When printing, use "Page Scaling: None")

PSOLA resynthesis from the stylized pitch contour.

  • Original samples: 1, 2
  • Using glissando threshold G=.16:
    Short samples: 1, 2
    Long sample: part 1 (2.166 MB)
  • Using glissando threshold G=.32:
    Short samples: 1, 2
    Long sample: part 1 (2.166 MB)

A closer look at tonal perception and stylization

Some F0 variations are clearly perceived as rises or falls; others go unnoticed unless after repeated listening; still others are simply not perceived at all. Indeed, tonal perception depends upon several factors.

Our approach to stylization takes into account

The stylization shows the effect of a change of the model parameters on the estimated perceived pitch contour. This is shown in the next sample, which compares the F0 curve and two stylization variants: the first with G=0.16/T^2, the second with G=0.32/T^2, i.e. a glissando threshold twice as high. The (intravocalic) pitch movements found on "chefs" and "gieux", in the case of G=0.16/T^2, no longer appear in the stylization with G=0.32/T^2.

In speech communication, utterances are heard only once. The listener has no time to reflect on the auditory properties of the signal. This differs from the situation of a hearing experiment, where a stimulus is usually repeated several times ans separated by silent pauses. How then should the glissando parameter be chosen in order to obtain a correct representation of pitch perception in continuous speech? By using (TD-PSOLA) resynthesized utterances in combination with stylizations for alternative parameter settings and presenting them to listeners together with a resynthesis of the original utterance, one can determine the glissando threshold for which listeners are unable to distinguish the stylized pitch contour from the original one. The setting with G=0.32/T^2 matches the performance of the listeners in continuous speech. To take into account the impact of a silent pause on the perception of the preceding pitch movement, the glissando threshold may be adjusted dynamically, depending on the presence of a pause.

Application to automatic transcription of intonation

The stylization by Prosogram has been used for automatic transcription of pitch contours and intonation.

A first type, called Polytonia (Mertens, 2014), indicates the pitch level and pitch movement of each syllable. Pitch levels are determined on the basis either of the speaker's pitch range and pitch intervals in the local context of a syllable ad within the syllable.

A second type identifies positions in prosodic structure, such as stressed syllable, pre-stress syllable, and prosodic boundary, and reinterprets Polytonia's pitch levels and movements in terms of such positions. This approach is called ToPPos (for Tones on Prosodic Positions) (Mertens, to appear).

References

Publications on the Prosogram

Other references

Applications


Page created: 2002-06-20. Last updated: 2016-07-06.