Tempo modulations in English




Kirkham, Sandra Patricia

Journal Title

Journal ISSN

Volume Title



The goal of synthetic speech is to provide speech that is both comprehensible and natural sounding. While synthetic speech is drawing nearer to its goal, it has not yet attained a truly natural quality. Naturalness can be improved by incorporating prosodic rules for duration and intonation that are representative of natural speech. While duration models are widely used, they fail to replicate the variations evident in the tempo of natural speech. This project proposes a model of tempo modulations in English based upon phrasal foci. In order to replicate this pattern, the potential phonetic locations for altering the speech rate of English synthetic speech are explored. The results of a pilot study based on the readings of one speaker suggested that tempo modulations are predictable and not random, and that they are not expressed as equal expansions and compressions across all syllable constituents. Vowels, onsets, and codas exhibited varying degrees of change. These results motivated a study of the same phenomena in data derived from the readings of multiple speakers. The data for the main study were derived from two readings of each of five Canadian English sentences. The first reading varied the position of a focused word in the sentence and the second, only the tempo. Sentences that were neutral in terms of focus and tempo were included in both readings to create experimental controls. The readings were recorded and digitized to provide waveforms for duration measurement. Comparisons of average durations of focused syllables to the respective controls revealed significant differences given an alpha level of .05, providing evidence that a pattern of tempo modulations can be predicted. This pattern involved expansion and compression within the sentence. The pattern can be replicated using the results of the investigation of sites for tempo changes. The results reveal that at a fast tempo and a slow tempo, the durations of syllable constituents change significantly from the control at an alpha level of .01. The vowel, particularly one that comprises a syllable, is the primary site for expansion and compression. Stressed vowels have the largest compression, while unstressed vowels have the largest expansion. The degree of segmental change varies depending on the position of the syllable constituent. In stressed CVC syllables, codas and then onsets exhibit lessening degrees of compression. TU reverse is true for expansion, and the degree of change for these constituents is less than that for compression. However, only stops in these positions show a significant change from the control. It appears that expansions and compressions of segments are ranked according to syllable constituency These ranked expansions and compressions of syllable constituents can be incorporated into an existing duration model for synthetic speech in order to replicate the observed pattern of tempo modulations in English. This tempo pattern provides variation at a sentential level and is an improvement over rules for emphasis that are specific to the emphasized word or part thereof. The pattern is expressed by duration rules, and the addition of the criterion for syllable constituency increases the natural distribution of changes in tempo provided a model to bring synthetic speech closer to the natural goal.



English language, Intonation, Speech synthesis, Tempo (Phonetics), English language, Prosodic analysis