Voice and his timbre: individual, gender and age
features
These quick notes are about voice timbre, from the point of view of
signal processing. For the physiological and specifically acoustic
aspects of the human voice, please refer to the classes of
Psychoacoustics and to general books about the subject (for instance,
the Chapter 11 by Mario Uberti in "Acustica Musicale e
Architettonica" - UTET).
The human voice
Physically, the vocal emission is the result of the vibration of the
vocal chords, which produce an excitatory vibration, propagating
in pharynx, oral and nasal cavities. These cavities have their own
resonance modes, that spectrally "mould" the emitted sound,
conferring to it a specific timbre quality.
From the point of view of the signal processing, the excitement sent
forth by the vocal chords can be schematized as a sawtooth waveform (the
vocal chords roughly operate like a reed), to say as a harmonic signal
having amplitudes decreasing with frequency.
If we restrict ourselves to the consideration of vowels (which are those
who mostly identify the specific timbre and individuality of voice), the
filtering effect (or shaping of the spectral profile) of the cavities
can be schematized as a the rather selective enhancement of some
specific frequencies. It is usual to consider at maximum 6 of these.
These prominent frequencies are called "formants". Every vowel
has its specific sequence of formants, and the process of identification
of the vowels actually consists in the unaware process of
identification of the formants themselves.
The formants frequencies are marked out with the letter F followed by a
figure in order of increasing frequency: F1, F2, F3, F4, F5 and F6. The
importance of the formants decreases with the growth of the order, and
in effects vowels can be identified, to a degree 0 of approximation,
only by the first two formants (F1 and F2), or better, by the
ratio of them (i.e. F2 / F1).
A virtual laboratory about some aspects of human voice.
Thanks to Svante Granqvist, of the KTH in Stockholm, who made
available some smart (simple, light and small) software tools, we can
easily make many interesting experiments. The page
of link to computer tools for the signal processing contains
also the link to these smptools.
Let now take into consideration two of them. Madde, a singing
synthesizer, and RTSect, a virtual oscilloscope & spectrum analyzer
in real time.
Madde produces a synthesized singing voice (in additive synthesis),
and RTSect can be used to see in real time the signals produced by Madde.
How to use Madde and
RTSect at the same time
Using
Madde
Using
RTSect
The voice. Individual, gender, age, fatures.
What does it make the differences among voices? What, particularly,
makes the difference among a masculine voice, a female voice, a child
voice?
The formants are resonances of the vocal tract, and they greatly
depend on the shapes assumed by the cavities (in the measure in which we
are able to modify them in order to modulate vowels), and on their
dimensions. We cannot obviously change the last ones, which therefore
mark the voice, through the formant profile, in a determinant way.
So we can expect higher formants for smaller dimensions. Then, higher
formants for children, for instance. Women have their vocal tract meanly
about 20 cm shorter than those of men, and in general smaller dimensions
of the oral cavities. As a result, their formants are higher than those
of men (in adults, roughly 20% higher). This is what makes the
difference of gender (and of age) in the timbre of the voice, not only
the pitch, which is by itself inadequate in order to explain gender and
age timbre. A man singing or speaking in falsetto will continue showing
a masculine timbre (sopranos and falsettistas sounds quite
different). In Madde, this can be experimented by modifying the
"Factor" in the Formants pane. We can moreover try to set
Formants to some frequencies taken by the data found in some of the
links quoted here.
Link: This article is addressed to advisors of transsexuals,
inviting them - in order to avoid future disappointments - to
clarify to their own clients that even after the surgical intervention
of change of sex, the timbre of the voice will remain the original
one. It would not be changed or simulated by simply modifying
the pitch (i.e. speaking in falsetto): Acoustic
Correlates of Speaker Sex Identification - Coleman.
What is it the "nasal" timbre? It is the raise of nasal
formants at 200-300 Hz, 1 kHz and 2 kHz, and of some "antiformants"
(antiresonances, that is suppression of frequencies), a global loss of
power of the first formant and of the higher frequencies, together with
a decrease of the Q for all the formants (Eric Keller, University of
Lausanne:
Tutorial
Review:The Analysis of Voice Quality in Speech Processing).
|
It is a free software for speech analysis from SIL
International (Summer Institute of Linguistics), an
international organization for the enhancement of the knowledge of
the little known languages, or of not written languages, founded
in 1934 by William Cameron Townsend.
SIL makes available a wide
repository of software programs for the linguistics, many of
which are free, for different platforms (Windows, MacOs, Linux,
PalmOS, Unix).
An alternative tool (more powerful and flexible) to perform
this kind of analysis is WaveSurfer, by the KTH in Stockholm, free
and multiplatform (Linux, Windows 95/98/NT/2K/XP, Macintosh, Sun
Solaris, HP-UX, FreeBSD, and SGI IRIX). based on SNACK, the sound
package for TclTk.
These tools can also be obviously used to analyze sounds of
musical importance, not only the singing voice. |
A forensic point of view.
In forensic practice it is usual to submit to expertise some
recordings made by wire tapping or interception, in order to identify
and prove which was the actual speaker. As the frequency location of the
first formants greatly depends on the dimensions of the vocal tract, it
is quite clear that the formants values are hard to modify or
counterfeit. A speaker can thus be in principle identified by the
statistic on the position of his/her formants. According to Manfred
Schroeder, these analyses are well suited only to exclude a
speaker, not for the contrary, because of the huge uncertainty in
the determination. As a result, he questions the use of the term
"vocal imprints" (in analogy to "digital imprints")
as being misleading.
Link: The LPC (Linear Predictive Coding) analysis divides
the whole signal into two components: an excitation, and a formant
filter. It is therefore particularly suitable for the vocal signal.
Here a tutorial
on the LPC.
This device
uses instead this type of analysis in real time and a correspondent
resynthesis, to modify the formants (besides the pitch) and,
consequently, the timbre of the voice. We can add, moreover, that it
can totally confuse whatever system of identification of the speaker,
in a sense or in the other. Namely, it can be use in either making
oneself unrecognizable, or to make oneself recognized as a
different person.
Link: This free program of the Institute of Dutch Phonetic
Sciences, performs a quantity of analysis and resynthesis of the
voice, included the reconstruction of the measures of the vocal tract
by means of the analysis of the formants. It is Praat,
one of the best tools, if not the best, for this purpose.
The "singer formant"
To speak about this subject, nothing better than quoting the abstract
of a Sundberg' paper:
The “singer’s formant” is a prominent spectrum
envelope peak near 3 kHz, typically found in voiced sounds produced by
classical operatic singers. According to previous research, it is
mainly a resonatory phenomenon produced by a clustering of formants 3,
4, and 5. Its level relative to the first formant peak varies
depending on vowel, vocal loudness, and other factors. Its dependence
on vowel formant frequencies is examined. Applying the acoustic theory
of voice production, the level difference between the first and third
formant is calculated for some standard vowels. The difference between
observed and calculated levels is determined for various voices. It is
found to vary considerably more between vowels sung by professional
singers than by untrained voices.
The center frequency of the singer’s formant as
determined from long-term spectrum analysis of commercial recordings
is found to increase slightly with the pitch range of the voice
classification.
Johan Sundberg, Voice Research Centre, Department of
Speech Music Hearing, KTH, Stockholm, Sweden, Level and Center
Frequency of the Singer’s Formant, Journal of Voice, Vol. 15, No. 2,
pp. 176–186
The reason of this prominent concentration of acoustic
power is may be due to a relative weakness in the same spectral region
for long-term spectra of operatic orchestra. The singer' formant is thus
a mean to emerge with respect to a large orchestra.

(National
Center for Voice and Speech, Iowa University )
A
research report on Sutherland'
and Gruberova voices, Geneva University.
UA
paper on the preference in choral singing for resonances close to the
singer formant, International Journal of Research in Choral Singing.
Underlying complexity.
Please don't be deceived by this quick exposition: what we are here
dealing with is far to be simple and for ever clarified. First, we have
restrained our attention to vowels, where the individuality of voices is
on the contrary also based on further features (transients, as in the
consonants), not unlike to what happens in musical instruments, in which
the spectral shape is only one amid the various components
of the timbre. Even restraining our attention to vowels, it must be
quite clear that these belong to a universe which is more populated than
every list that anyone can compile on the mere basis of his own
knowledge. The vowels are not an absolute datum, and don't depend only
on the language. There are innumerable dialect and sub-dialect
variations. This
study deals with the differences between the Pisa' vowels and
Florentine vowels, while this
one is a wide review of the work in progress about the study of the
vowels (and on their formant representation) in relation to gender, age
and (very) specific dialect. The determination of the connection
formants-vowels, beyond coarse subdivisions, is full of uncertainties
due to the high variability and - probably - to the intervention of
different, not purely perceptive mechanisms (for instance, semantic and
cognitive) in the human process of vowels identification.
As it often happens, statistics are unable to describe and capture
phenomena in which the human behavior - in a late and manifold sense -
is critically present. This complexity can explain the extreme slowness
with which, after an initial exploit, the speech recognition systems are
today progressing (STT - Speech To Text or ASR - Automatic Speech
Recognizer). They don't even came close to the robustness of every human
"recognizer" (namely an interlocutor).
The state of the art of the opposite systems (TTS - Text to Speech),
is instead quite different. They are able to automatically read out
written texts, at a good qualitative level. In this field, the Italian
firm "Loquendo" (former CSELT of the group STET, today owned
by Telecom Italia) deserves to be noted, together with its system
"Actor" that has marked a notable progress in comparison to
the former "Eloquens". Here
you can try the capabilities of the system, and compare it with the
previous one (which is "Mario, Robotic Voice" in the menu of
voices).
Keywords.
formants
- "formants
determination" - formants
analysis
A laboratory
for the analysis.
|