Prosogram, v2.12b

Transcription of prosody
using pitch contour stylization based on a tonal perception model
and automatic or annotation-based segmentation

by Piet Mertens


Compact Prosogram
X-axis: time (in s). Y-axis: perceived pitch (in semitones, ST); dotted grid lines are 2 ST apart.
The vertical dotted lines show segmentation boundaries.
Click on the picture to see a full size, high resolution version in GIF file format.
Large rich
Large Rich Prosogram
The large format adds the calibration of X and Y axes (in ST, relative to 1 Hz).
Rich format adds intensity (thin green line) and F0 (thick blue line) on a ST scale, for validating the results.
Automatic segmentation
Large Rich Prosogram
This prosogram uses automatic segmentation based on intensity of band-pass filtered signal.


Since the early days of intonation research, automatic transcription of the intonation in speech corpora has been on the wish list of many a researcher in phonetics, linguistics, and discourse analysis.

Most phoneticians use the fundamental frequency (F0) to represent pitch contours in speech. F0 is an acoustic parameter; it provides useful information about the acoustic properties of the speech signal. But it certainly is not the most accurate representation of the intonation contour as it is perceived by human listeners.

In the seventies pitch contour stylization was introduced as a way to simplify the F0 curve to those aspects which are relevant for speech communication. The approach originates from work by J. 't Hart, R. Collier, and A. Cohen at IPO, Eindhoven, and was further improved by D. Hermes in the '80 and '90. Other types of stylization have been proposed, such as the Momel system by D. Hirst, R. Espesser from Aix-en-Provence. However, most of these stylization approaches are based on statistical or mathematical properties of the F0 data and mostly ignore the facts of pitch perception.

It is well known that the auditory perception of pitch variations depends on many factors other than F0 variation itself. In 1995 a stylization based on the simulation of tonal perception was proposed by Ch. d'Alessandro & P. Mertens (Mertens & d'Alessandro, 1995, d'Alessandro & Mertens, 1995). The purpose of this stylization is to provide a curve which approximates the auditory image in the listener's mind. This tonal perception model was validated in listening experiments using stimuli resynthesized using the stylized contour (Mertens et al, 1997).

This same approach may be used to obtain a transcription of intonation. This requires a segmentation of the speech signal into syllable-sized units, that are motivated by phonetic, acoustic or perceptual properties. The prosogram may be used in conjunction with various types of segmentation:

The stylization is applied to the F0 curve of those segmented units, which are approximations of the syllabic nuclei.

How does it work ?

Processing steps.

How is it implemented ?

The system is implemented as a Praat script. Praat is a tool for acoustic and phonetic research, written by Paul Boersma and David Weenink, of the Institute of Phonetic Sciences in Amsterdam. The choice of Praat is motivated by the fact that it is powerful, user-friendly, programmable, freely available, running on many platforms, and actively maintained.

How to obtain the phonetic segmentation ?


A small corpus of spoken French was processed to illustrate the results obtained with the transcription tool. The corpus consists of about 4 minutes of an interview between Fayard and Benoîte Groult broadcasted on Radio de la Suisse Romande.

Audio files

Transciptions (Prosograms) (In Acrobat PDF format. When printing, use "Page Scaling: None")

PSOLA resynthesis from the stylized pitch contour.

  • Original samples: 1, 2
  • Using glissando threshold G=.16:
    Short samples: 1, 2
    Long sample: part 1 (2.166 MB)
  • Using glissando threshold G=.32:
    Short samples: 1, 2
    Long sample: part 1 (2.166 MB)

A closer look at tonal perception and stylization

Some F0 variations are clearly perceived as rises or falls; others go unnoticed unless after repeated listening; still others are simply not perceived at all. Indeed, tonal perception depends upon several factors.

Our approach to stylization takes into account The two latter thresholds are model parameters, which can be adjusted. The stylization will show the effect on the estimated perceived pitch contour. This is shown in the next sample, which compares the F0 curve and two stylization variants: the first with G=0.16/T^2, the second with G=0.32/T^2, i.e. a glissando threshold twice as high. The (intravocalic) pitch movements found on "les", "chefs" and "gieux", in the case of G=0.16/T^2, no longer appear in the stylization with G=0.32/T^2.

In normal conversation, utterances are heard only once. Given the continuous flow of speech, the listener has no time to reflect on the auditory properties of the signal. How then should the glissando parameter be chosen in order to obtain a correct representation of pitch perception in continuous speech ? The intonation of Groult corpus has been transcribed by ear by two expert transcribers. This auditory transcription was compared with various stylizations using different settings of the stylization parameters. The setting with G=0.32/T^2 better matches the performance of the human transcribers.


Publications on the Prosogram

Other references


Page created: 2002-06-20. Updated: 2002-09-19, 2002-10-02, 2003-07-06, 2004-02-29, 2005-02-08, 2007-04-21, 2009-01-06, 2009-11-08, 2011-05-01.