Fonilex pronunciation database: Introduction

The Fonilex pronunciation database




Introduction

Fonilex is a pronunciation database containing the phonetic transcription of the most frequent word forms of Dutch as spoken in Flanders. The database was compiled within the Fonilex research project, which was funded by the Flemish government through IWT, between january 1995 and july 1997.

Participants

Three academic research groups participated in the project:

Content

The Fonilex database is a list of over 200.000 Dutch word forms together with information about the way in which they are pronounced in the Flemish speaking part of Belgium. As such it is the first database of the pronunciation of standard Flemish.

The pronunciation information consists of two parts: first, an abstract representation of the pronunciation of a given word form, and second, the concrete pronunciation of this word form in three speaking styles.

The database is accompanied by a set of phonological rewriting rules; these were used to derive the phonetic (concrete) transcriptions from the abstract form. As such the rule system accounts for most pronunciation variants. This rule-based approach also enables the user to derive a particular pronunciation style (or a particular representation) according to his own requirements, by simply adapting the rule system.

A characteristic of Fonilex, which distinguishes it from most other pronunciation databases, is that it contains most pronunciation variants, and that is has the capacity to generate additional variants. This is because, in Fonilex, pronunciation variants are not considered to be (unwelcome) exceptions, but on the contrary they result from the application of phonological rules, which describe general phonological processes.

Fonilex was compiled semi-automatically, and verified manually. The initial form of the entries was obtained using grapheme-to-phoneme conversion, i.e. using a computer program that derives a representation of the pronunciation of a word or a sentence on the basis of its orthographic form. However, each indiviual phonological transcription was checked manually, or hand-corrected, if one prefers. Due to the detail of the transcription (needed to account for pronunciation variants), the corrections were numerous. As the abstract representation evolved during the project, and the notation changed accordingly, many of the entries had to be verified several times. This enormous task was performed mainly by one and the same person, which ensures consistency of transcription. We strived for a reliable database, rather than a large one.

Applications

The major applications of phonetic databases such as Fonilex are in language and speech technology. Speech technology involves either speech synthesis or speech recognition, or both. For instance, in text-to-speech systems, text in the memory of the computer is converted to a speech signal, so the user can listen to the computer instead of looking at the computer screen. In speech recognition, an audio signal, such as the user's voice, is identified as a sequence of words, and this is used, for instance, in dictation systems, to type out sentences the user reads to his computer.

All sophisticated speech applications require a pronunciation database. In speech recognition, the database is needed to map a sequence of sounds onto a particular word form. In speech synthesis, it may be used to obtain the position of word stress, or to obtain the sequence of sound symbols. The latter task, known as grapheme-to-phoneme conversion, is generally handled without a complete pronunciation database, but the use of a phonetic database will significantly reduce the error rate.

Speech technology already proved its usefulness in reading and speaking aids for the visually handicapped, as well as in tools for hands-free computer interaction. Nowadays it has become clear that its use in general purpose applications, such as dictation systems and messaging systems, will soon be generalized and that speech will become a primary means of computer interaction. Other applications will then emerge, such as use of speech synthesis in language teaching, and in education in general. In all these applications, it is crucial to have an phonetic database for those programs that require it.

Because Fonilex records also contain a cross-reference to the corresponding Celex database records, which among other things include morphological information, the phonetic database can be integrated in natural language processing systems. (As a matter of fact, it can be integrated in any language processing system for Dutch, provided a morphological analyser be used on the spelled word form.)

Finally, the Fonilex database will be useful for basic research in the area of the phonology (and morpho-phonology) of Dutch, where it provides a systematic list of language facts phonological theories have to account for. The digital form of the database enables the linguist to verify hypotheses very quickly and in a systematic way, rather than manually, as used to be the case. As a matter of fact, this approach was already applied in the construction of Fonilex itself, and to some extent determined the shape of the abstract representation.

Availability

The Fonilex database is available for scientific or commercial use. Please print and complete the licence agreement, add your signature, and return it to the project coordinator by surface mail. You will receive a password for downloading the database.

Acknowledgements

We are grateful to Max Piepenbrock from Max-Planck-Institut (Nijmegen) for making available the list of word forms of Celex, and for help with the changes needed for the spelling reform.
This file is maintained by Piet Mertens
Last updated: November 21, 2001