FUNDAMENTALS OF DIGITAL SOUND
(formerly “Computers in Musical Applications”)
MAURICE WRIGHT
In such simulations, after you had reduced the sound to digital form, you reconstructed
the sound from a stream of computer-processed digits—it occurred to some of us,
especially to Max Mathews and me: Why couldn’t you just put in a stream of digits and
get out any sound that you wished?
John R. Pierce, interviewed in 1979
***
Copyright © 2018 by Maurice Wright
All rights reserved
No portion of the text or illustrations may be reproduced or transmitted
without written consent.
ii
Contents
I THINGS ELECTRIC...................................................................... 1
II MUSICAL ACOUSTICS .............................................................. 19
III A FEW WORDS ABOUT COMPUTERS ....................................... 41
IV SAMPLING THEORY ................................................................. 51
V SOUND RECORDING AND EDITING .......................................... 60
VI SYNTHESIS .............................................................................. 69
iii
The Analytical Engine has no pretensions whatever to originate anything.
It can do whatever we know how to order it to perform.
Lady Ada Lovelace
If a machine is expected to be infallible, it cannot also be intelligent.
Alan Turing
iv
1
CHAPTER 1
THINGS ELECTRIC
Basic Principles
Electricity is everywhere. Smart people have a healthy fear of
electrical shock, but have learned to use electricity safely (with a few
tragic exceptions.) Because everything the technologist does will most
certainly involve electricity, it makes sense to know something about it.
The basics are easily understood, and can conveniently be expressed with
mathematical formulas.
Components of electrical circuits are partly analogous to those of
plumbing. As you open a faucet, water begins to flow, and flows with
greater volume as the faucet is opened further. More water can flow
through a large pipe than a small one. Unlike air, water cannot be
compressed. Why does water come out of your faucet? Because it is
under pressure. Municipal water systems pump water out of reservoirs or
wells and into large storage tanks high enough above the highest point of
the system so that gravity provides the water pressure.
A water company tries to provide between 50 and 100 pounds per
square inch (PSI) of water pressure. Each foot of height provides .43 PSI,
So water tanks must either be situated on a hill or on top of a tower.
When the water falls below a certain point in the tank, the pumps are
switched on and more water is pumped into the tank, maintaining an
almost constant pressure. Eventually the water evaporates and falls as
rain, refilling the reservoir or the underground aquifer. A typical water
tank is shown in figure 1.
2
Figure 1. A water storage tank sits atop a tall tower
to provide a source of water under pressure.
3
As water enters a house it flows through a water meter, which
measures the number of gallons of water taken from the water company
into the house. From this one point, the water supply branches out to
eventually get to your faucet. Assuming that the water supply is ample
(no open hydrants, for example), the water pressure will be the same
entering all the pipes in your house. However, more water can flow
through a large diameter pipe than a small one, as illustrated in figure 2.
Figure 2. More water flows through a large pipe than a small one.
Water under pressure can be made to do work, like turning a paddle
wheel. When this happens the water gives up some of its energy.
To complete the analogy, replace the elevated water tank with a
source of electricity such as a battery, and instead of pipes, think of
wires. Electricity runs from the battery through the wire, and can branch
to multiple wires as well. It can be made to do work, and can be controlled
with switches and other controllers.
4
Lose the water
By now the analogy has served its purpose. Unlike water, electricity
is not usually visible. Furthermore, whenever electricity flows, a magnetic
field is generated. A grouping of electrical components is called a circuit,
because the electricity always flows in a complete circuit, returning to the
battery’s opposite pole. A physical battery can take on a variety of forms,
but is always represented on a circuit diagram as one or more pairs of
lines of alternating thickness, shown in figure 3.
Figure 3. Common electrical components and their symbols
Similarly, other components have one or more symbols to represent
them in circuit diagrams. One of the earliest practical uses of electricity
was lighting, and the once ubiquitous incandescent light bulb makes an
excellent first circuit to discuss. Examine figure 4.
Figure 4. A simple lighting circuit
5
First, notice a different symbol for the light bulb than the one used in
figure 3. Also the “+” sign is omitted in figure 4. The idea is simple
enough: electricity flows through the wires to a switch, then on to the
bulb, then back to the battery. In some countries a switch like this is
called an “interrupter.” When the switch is closed, electricity flows
through the light bulb, the tungsten filament heats up until it glows.
Electricity is thus converted to heat and light.
What if you want more than one light? There are two ways to add
an additional light to the circuit, and they are shown in figure 5.
Figure 5. Parallel and series connections
In the parallel connection both bulbs receive the same flow of electricity
and should glow as brightly as the single bulb in figure 4. In the series
connection, however, the each bulb reduces the amount of electricity
available to the other, and the light is dimmer. Furthermore, removing one
of the bulbs will not affect the other in the parallel circuit, but will
interrupt the flow to the remaining bulb in the series circuit.
You could also arrange a circuit with two switches, as shown in
figure 6. With switches in parallel, either switch (or both) could close the
6
circuit. In the series connection, both switches have to be closed.
Figure 6. Parallel and series switch connections
This simple switch is called a single pole single throw (SPST) switch
because it interrupts only one path of electricity of electricity at a time,
and is either on or off. Many household light switches are SPST, although
you should never to take the cover off the switch to inspect it because
household voltage can be very dangerous. Another interesting switch is
the single pole double throw (SPDT). It can be used to select one of two
lights, as shown in figure 7.
Figure 7. A double throw switch chooses between two lights
7
The most common use of the SPDT in the “three way” switch, used
when one wants to control one light from two locations. Such switches
should not be labeled “on” and “off” because either switch can change
the function of the other.
Figure 8. Lighting circuit with a “three way” switch
Resistance Is Necessary
Imagine (but do not construct) the circuit shown in figure 9. When
the switch is closed, the electricity will flow unchecked from one end of
the battery to the other. This is a “short” circuit, and is never safe. The
wire can become very hot, perhaps even melt. At minimum the battery’s
energy is wasted, at worst it may overheat and explode, releasing the
volatile chemicals inside.
Figure 9. A short circuit always presents a hazard.
8
To protect against a short circuit homes, appliances, and electrical
equipment use circuit breakers and fuses to disable the circuit if a short
circuit is detected.
Circuits contain elements that resist the flow of electricity. They
are called resistors and are represented in diagrams by a zig-zag line, seen
in figure 10.
Figure 10. Circuit symbols for resistors
Resistance is present in every circuit, whether in the form of
resistors or in the resistance inherent in other components, such as the
light bulb.
To summarize, electricity can be made to flow through a wire,
similar to way water can be made to flow through a pipe. Electricity flows
from a source to a resistance then back to the receiving end of the
source. A battery is a simple source of electricity. Switches can interrupt
a circuit or route electricity among circuits. Multiple resistances can be
connected in series or in parallel. Short circuits must be avoided.
9
Quantities and Relations
Before actually building a circuit with a soldering iron, shown in
figure 11, it is useful to know the names of the electrical properties in
which we are interested, and how those properties interact with one
another.
Figure 11. Use eye protection when soldering
The force that pushes the electricity through is the volt, named for
Giuseppe Antonio Anastasio Volta, the inventor of the electrical battery.
The unit of resistance that inhibits the flow of electricity is the ohm,
named for Georg Ohm, who figured out the relation among voltage,
resistance and the flow of electrical current. The unit of current flow is
the ampere, named for André-Marie Ampère.
10
Georg Ohm proved that the Intensity of current in amperes, “I”, is
the ratio of the Electromotive force in volts, “E”, to the Resistance in
ohms “R”. His famous law thus reads I = E/R. If the resistance remains
stable but the voltage is increased, the current flow increases. If the
voltage remains the same but the resistance increases, the current flow
decreases. If you multiple both sides of the equation by R:
RI = RE/R, and cancel the R’s on the right, you get RI = E, or (flipped
around) E = IR. If you know the amount of current flowing, and the
resistance through which it flows, you can calculate the voltage that is
pushing. Divide both sides of this equation by I, and you get RI/I = E/I, or
R=E/I. If you know the voltage and the current flow you can calculate the
resistance.
This simple and versatile formula governs all circuits when the
direction of current flow is one way only. This is a direct current, (DC)
circuit. Ohm’s law expresses the relations among volts, ohms and amperes
(amps.) Because some current flows are tiny, and resistances huge,
prefixes such as milli, micro, kilo, and mega, are used to scale the values
accordingly. Figure 12 names the first 8 standard prefixes (there are
more!).
Figure 12. Some common metric prefixes
The unit of electrical Power is the watt, which is defined as the
product of the electromotive force (in volts) and the intensity of current
(in amperes): P=I*E
Electricity is sold in units called kilowatt hours, which means 1000 watts
of power consumed over 60 minutes.
Look at what happens to power if you double the voltage in a
circuit while keeping the resistance constant. To keep the math simple,
imagine 1 volt pushing against of 1 ohm. How much current flows? Using
I = E/R we calculate one amp of current. To find the wattage, multiply the
voltage (1) times the current (1), to get 1 watt. Now double the voltage,
and calculate the current using I = E/R. With the voltage pushing twice as
11
hard, the current flow doubles, to 2 amps. When you calculate wattage
and multiply 2 by 2, you get 4 watts. The power is thus equal to the
square of the voltage divided by the resistance. This relation will come
back to haunt us as we try to define the decibel.
Go with the flow (or not)
So far we’ve talked about electrical circuits connected by wires.
Why not use strings or rubber bands? Some materials do not conduct
electricity. They are called insulators, of which glass is a good example.
Ceramics and plastics are insulators unless they are coated with a metallic
layer. Metals are conductors, and copper the conductor of choice for
most electrical applications because it is highly conductive and flexible. It
the conductivity standard by which other metals are measured. Figure 13
illustrates the “resistivity” of several metals.
Figure 13. Resistance of common metals
Most wiring used in audio equipment employs a copper conductor
with a plastic insulating covering. When high flexibility is required, such as
microphone cables, stranded cable is used. Inside equipment, solid wire
with a plastic conductor is used because it will not likely move much.
Multiple wires, each with its own insulation, can be grouped into a single
cable.
12
The diameter of the copper wire affects its resistance and its
weight. A look at figure 14 shows that as the diameter increases, the
resistance falls, but the weight increases.
Figure 14. Resistances for various diameters of copper wire.
“AWG” stands for American Wire Gauge, and is the standardized
measure for copper wire. A tiny 40 gauge wire has a resistance of one
ohm per foot, and would be used only where very tiny current flows will
occur. Current pushed through a resistance generates heat, so too much
current will make the wires too hot—another reason to avoid short
circuits. Even an overloaded circuit can cause an electrical fire.
13
Magnetism: An Invisible Force
Magnets create a magnetic field. Although you cannot see a
magnetic field, you can see its effects. Magnetism is a polarized force,
meaning that a magnet’s position affects the way its force is oriented and
how it interacts with other magnets. Opposite poles attract, similar poles
repel. The two poles of a magnet are labeled “north” and “south.” The
earth itself has a magnetic field. If you build a little boat that can float
with a bar magnet inside and float in a tub of water, its north pole will
point to the earth’s magnetic north pole, hence the notion of a north-
seeking pole.
Ferrous metals such as iron can be made to hold a magnetic charge
and are call permanent magnets. Other metals can me briefly magnetized,
while still others are impervious to a magnetic field. Of great importance
to our study of electricity is the fact that when electricity flows through a
conductor, it creates a magnetic field. Conversely, when a conductor
passes through a magnetic field, a current is induced into the conductor.
There is a major difference between an electric current producing a
magnetic field and a magnetic field producing a current. The magnetic
field remains constant as long as the electricity flows, but a current is
produced only when the magnetic field is changing.
Alternating Current (AC)
AC refers to an electric current that changes direction. AC is widely
used in commercial power generator and household appliances, and is
very useful in constructing an analog of sound waves. Many AC circuits
are modeled using a sine wave, even though pure sine waves are rare.
Because the direction of current changes periodically, so would the
magnetic field created by the current. And, since the magnetic field
changes direction when the current does, it can induce a voltage into a
conductor within that field. This is the principle of the transformer,
pictured in figure 15 along with its electrical circuit symbol.
14
Figure 15. An electrical transformer and its circuit symbol.
Current flowing into the transformer (called the Primary coil)
creates a magnetic field that induces a current into the Secondary coil.
Some transformers have a ferrous metal core (the dashed lines on the
circuit symbol), while others have no metal core.
If the secondary coil has more turns of wire than the primary, the
transformer is described as a “step up” transformer because the voltage
at the secondary is higher than the voltage at the primary. Step-down
transformers are used to reduce the distribution line voltage (thousands
of volts) down to the 110/220 volts used in a typical American house. If
the primary voltage is 2200 volts, and the ratio of turns of wires of
primary to secondary is 10 to 1, then the secondary voltage will be 220
volts. However, the transformer is usually “center tapped,” meaning that
the secondary is divided into two parts by a “tap” in the center coil.
Figure 16 shows the circuit diagram of a center tapped transformer. If
this were a typical transformer providing power to a house, the voltage
across either half would be 110 volts (used for most appliances and
equipment) but the voltage across the entire secondary (ignoring the
tap) would be 220 volts (used for high consumption appliances such as
air conditioners, ovens, and clothes dryers.
Electronic circuits often require a power course of about 6 volts
DC. A transformer can step down the household voltage to 6 volts AC,
then other circuit components transform the AC to DC. Transformers are
sometimes used to isolate one part of a circuit from another, or to match
electrical characteristics of different components or circuits.
15
Figure 16. The secondary of a center tapped transformer is divided
in half to produce two voltages of opposite polarity.
Inductors and Capacitors
Now consider half a transformer, just a coil of wire that might or
might not have a ferrous core. When a voltage is first applied to it, the
change in voltage creates a magnetic field. During this charging time, no
current flows out of the coil. Once it is all charged up and the magnetic
field is established, it acts like a simple conductor and current can pass.
When the voltage source is turned off, the magnetic field dissipates,
inducing a current into the coil. Eventually the magnetic field falls to zero,
and the current stops flowing. This cycle is graphed in figure 17, taken
from the TDK corporation’s Knowledge Box.
Figure 17. The current response of an inductor in a DC circuit
16
The concept of storing energy in an inductor is opposite to that of
storing energy in a capacitor. If two sheets of metal are positioned very
close but not touching (imagine two pizza pans separated by a large piece
of waxed paper), and a DC voltage is applied to the two plates (the
general term for our pizza pans), current will flow for a brief period of
time! That’s because some energy is stored in an electrostatic field, with
positive charge collecting on one plate and negative charge on the other.
Once the capacitor is all charged up the current stops flowing. In figure
18, the red lines show the large flow of current into the capacitor as it
charges, after which no current flows. The blue line shows that the
capacitor has actually stored a voltage. When the capacitor is discharged,
the current flows quickly at first, but eventually falls to zero as the
voltage is reduced.
Figure 18. The current response of a capacitor in a DC circuit
A capacitor blocks direct current (after a brief time), but an
inductor passes direct current (also after a brief time.) A capacitor
passes alternating current, but an inductor blocks alternating current.
The unit of inductance is the henry, and the unit of capacitance is the
farad.
17
A Simplified Power Supply
To convert the stepped down AC from a power transformer into 6
volts DC, we need one other component, the diode. A diode allows
current to flow in one direction but not the other, similar to the sign in
Figure 19.
Figure 19. A diode allows current to move in only one direction
The circuit symbols for the components used to fashion a simple
power support are shown in figure 20. This is a hypothetical power
supply, dreamed up to illustrate the properties of components that are
important for AC circuits. Do not try to build anything that connects to
the 120 volt wall socket until you are an experienced circuit builder with
someone to check your work.
Figure 20. Circuit symbols used in a simple power supply
18
The circuit in figure 21 begins with a step-down transformer. The diode
lops off the negative half of the AC circuit, then inductor helps to fill the
gap between cycles, and the capacitor strains out any residual ripple. The
result is direct current.
Figure 21. Voltage plotted against time at various stages in
a simplified power supply
Although resistors offer a constant value regardless of frequency,
inductors and capacitors “push back” differently as frequencies vary. In
general inductors pass low frequencies but block high frequencies,
whereas capacitors block low frequencies but pass high frequencies. This
frequency dependent characteristic of inductors and capacity is called
reactance. When resistance and reactance are combined, the result is
called impedance, and is measured in ohms.
AC circuits are significantly more complicated than DC circuits, but
their ability to represent physical oscillation make then essential to all
sound technology.
19
CHAPTER II
MUSICAL ACOUSTICS: A BRIEF REVIEW
On the nature of sound
Figure 1. A tree falls in the forest
What is sound? Sound is our subjective impression of the episodic
20
perturbation of air molecules. “Subjective impression” because we are
only interested in sound that we can perceive, at least for musical
purposes. Our hearing mechanism, from ear to auditory nerve to brain,
evolved to provide maximum sensitivity to the sound of human speech,
and some sensitivity to other sound as well. Music makes use of a wide
range of sound. “Episodic perturbation” means a transient disturbance of
some sort instead of the constant pressure of the atmosphere. “Air
molecules” are an important of the definition because they usually are the
medium through which sound travels.
By this definition, the sound of a tree that falls in the forest, e.g.
that of Figure 1, is of interest if we were present to hear it, or if it was
recorded for later listening.
A sound may be loud or soft depending on the displacement of the
air molecules. A sound may be high pitched or low pitched depending on
the rate at which the molecules change position. A sound may
recognizable if we have heard similar sounds often before, strange if we
have not. Sounds have at least two dimensions: the acoustical, which
concerns the vibration of matter and the transmission of the news of that
vibration through a medium, and the perceptual, which accounts for our
subjective response to the acoustical information.
The earth is covered with a blanket of air. A combination of gases,
air molecules arranged in a random but uniform distribution are in
constant motion, continually jostling against one another. The weight of
the blanket of air at the earth's surface is about 15 pounds per square
inch, 2160 pounds per square foot, or 30,108,672 tons per square mile!
Sound events correspond to very small changes in atmospheric pressure
that occur at a rate that interests our auditory mechanism. Changes in
pressure caused by the motions of weather systems occur too slowly to
be heard. Similarly, the random dance of an individual molecule is
canceled by the random action of the group, or else creates too small a
pressure change to be noticed.
21
Figure 2. An exploding guitar
If we pluck a guitar string or detonate an explosive charge–or
perhaps both at once, as shown in Figure 2–the mechanical energy of the
sound source is transferred to the air molecules that surround it, creating
contrasting areas of compression and rarefaction. The molecules first
affected transmit energy to the molecules adjoining them and those
molecules transmit energy to the next layer, and so on. During each
collision, some of the energy is lost in friction and converted to heat. The
sound wave also covers an ever increasing area, so its energy is divided
among an increasing number of molecules. Thus the sound wave
eventually dies out, in the sense that it becomes inaudible.
The molecular disturbance that is the sound wave propagates at a
rate of about 1128 feet per second, depending on temperature and
humidity, and, since the energy is continually converted to heat and the
remaining energy is distributed over an expanding surface area, the
intensity of the sound decreases rapidly as the distance from the source
increases.
Most musical sounds have more in common with the sound of a
guitar string than an explosive detonation. The plucked string exhibits
22
vibratory motion as the string crosses back and forth past its initial
resting position. Similar to the way a pendulum swing measures a regular
time interval, the motion of the plucked string measures a regular,
although usually shorter than a pendulum, time interval called the period
of vibration (often measured in seconds, or milliseconds). When repetitive
musical events occur at low rates, say 2 to 4 times per second, we hear
them as rhythmic events, but when the repetition occurs faster then 30
times per second we begin to sense a musical pitch. The rate of repetition
of a regular event is called frequency (usually measured in cycles per
second or hertz), and is inversely proportional to the period of vibration:
f = 1/p.
Many underlying principles of sound are most accurately and
efficiently expressed as mathematical formulas or equations, particularly
proportions such as distance traveled equals speed times time. The parts
of the equations are variables, constants and operators. To change
temperature from centigrade to Fahrenheit, for example, you multiply the
centigrade temperature (a variable) by 9/5 (a constant) and subtract 32
(another constant.) The equation is for temperature conversion is F =
9C/5 + 32. Or F = 1.4C + 32, or F = 32 + 1.4 C.
In an inverse proportion, one or more of the variables is found in the
denominator (lower term) of a fraction. So we see that in f = 1/p, where f
stands for frequency and p for period, as the period of vibration gets
shorter, the frequency increases.
23
Figure 3. Summer morning by a Lake
Whether a repeating event is perceived as a rhythm or a pitch
depends on its period of repetition, and is an attribute of the human
auditory mechanism. The area wherein one kind of perception changes to
the other is around 20 hertz (Hz), or 50 milliseconds. Imagine a quiet
sylvan setting, say a summer morning by a lake. Overwhelmed by the
slowly shifting colors in the hazy sunlight you recall pleasant memories of
family and friends, perhaps, or an orchestral piece by Schoenberg. POW!
POW! Putt-putt-putt... An unseen neighbor has fired up the outboard
motor. The putt-putts average about 2 per second, a period of 500
milliseconds or a frequency of 2 Hz. This auditory onslaught has a tempo
of mm=120 (120 putts per minute). As the Noisy Neighbor, e.g. Figure 3,
opens the throttle, the engine runs faster and faster until the putts blend
together into a low, buzzy tone. The boat motor has only changed speed
but our perception of its sound has changed dramatically.
24
Measures Of Sound
Figure 4. Potential for a really loud sound
Loudness
The disturbance of air molecules caused by mechanical motion can
be measured to obtain a number that corresponds to the sound's
acoustical energy, but our ears provide a familiar measure of perception
that we think of as loudness. The ear distinguishes among a very large
range of sound pressures, and does so by responding to ratios of
acoustical energy.
The lower limit of perception is the threshold of audibility, below
which sounds are too soft to be heard. The upper limit is the threshold of
pain, beyond which prolonged exposure to extremely loud sound would
damage the auditory system, resulting in high frequency hearing loss,
tinnitus (a persistent buzzing noise) or deafness. The equipment shown in
Figure 4, could damage a person’s hearing, but so could headphones or
25
ear buds, if used indiscriminately.
Because of the extremely large range of sound intensities that the
ear and the rest of the auditory system can discern, and because of the
way the sensory system responds, it is useful to think in terms of ratios
of sound intensity rather than absolute measures.
One way to measure a sound’s intensity, which as mentioned earlier
corresponds to the motion of perturbed air molecules, is to use a
calibrated microphone connected to a special display. Such a device is
called a sound pressure level meter, and its display reflects the power of
sound found in a defined area of air. Because the class will concern itself
with synthesis and recording, it is more convenient to think of the
intensity of a sound as the pressure it exerts instead the power of the
sound. As confusing as this might sound, the intensity of the sound will
be proportional to the voltage of an electric signal emanating from a
computer, but the power corresponds to the voltage multiplied by the
ensuing current flow.
26
Figure 5. Diagram of a loudpeaker taken from
http://hyperphysics.phy-astr.gsu.edu/hbase/hframe.html
To help clarify the distinction, consider a loudspeaker, such as the
one diagrammed in Figure 5. Although designs vary widely, some speakers
are as simple as a paper cone attached to an electromagnet that moves
freely in a magnetic field provided by a permanent magnet. Voltage is
electrical pressure, like water pressure in a pipe. When pressure is applied,
current can flow. The greater the voltage, the more current flows. A
negative voltage makes current flow in the opposite direction. When
current flows through the electromagnet, it creates a magnetic field that
either pulls the cone away from or else pushes it toward the permanent
magnet. Thus a varying voltage applied to a speaker makes the cone
move, provided the supply of current is adequate for the work involved in
moving the cone. Efficient speakers require less current than inefficient
ones, but for the discussion that follows, let’s ignore current by assuming
there is always plenty, and consider only the voltage that is applied to the
speaker.
27
Imagine that we hear three sounds of increasing loudness. If the first
sound is produced when 1 volt is applied to a speaker, then second sound
when 2 volts are applied, and the third sound when 4 volts are applied,
the third sound will seem louder than the second to the same degree as
the second is to the first. In other words, doubling the voltage produces
an equal change in the perceived loudness of the sound. If the voltages
involved were .25, .5 and 1, the succession of loudness changes would be
the same, even though the overall loudness is less. This is why it is useful
to think of ratios of intensity instead of absolute measures. A similar
phenomenon governs the relationship of musical pitch to frequency,
which will be discussed later.
Before marveling at the range of sound intensity that the human ear
can discern, we should review some simple math.
Repeated multiplication gives rise to the idea of an exponent,
notated as baseexponent . So 2 times 2 times 2 can be expressed as 23, and
10x10x10x10 = 104 . The same expression works with fractions. The
reciprocal of a number is formed by dividing “1” by the number, so the
reciprocal of 2 is 1/2, the reciprocal of 10 is 1/10, and so on. Using
exponents, the reciprocal is formed by changing the sign of the exponent,
thus just as 21 = 2, so 2-1 = 1/2. A non-zero number raised to the zero
power is 1, completing the simple list of possibilities: 103=1000,
102=100, 101=10, 100 = 1, 10-1 = .1, 10-2 = .01, 10-3 = .001 and so on.
Like a lot of mathematical things, the process of exponentiation can
be reversed. Just as you can square a number by multiplying it by itself
(“4 squared is 16”), you can take the square root of a number by asking
what number multiplied by itself equals the number in question: (“the
square root of 16 is 4.”) Raising a known base to an exponent is equally
useful in reverse. Questions such as “to what power must I raise the base
10 to yield 10000?” or “to what power must I raise the base 10 to yield
.01?” are the essence of the logarithm. Here are the answers:
log10(10000) = 4, and log10(.01) = -2. Students who need to review
exponents, logarithms and similar mathematical tools would be well
advised to brush off their math textbooks, or head to a website such as
http://hyperphysics.phy-astr.gsu.edu/Hbase/hmat.html before
proceeding to our discussion of the enormous range of the human ear, as
pictured in Figure 6.
28
Figure 6. The human ear
from http://personal.cityu.edu.hk/~bsapplec/Fire/Image299.gif
The ratio of intensity at the threshold of pain to the intensity at the
threshold of audibility is about 1,000,000,000,000 to 1, rewritten more
simply as 1012:1. This huge ratio is testimony to the impressive ability of
the human ear, shown in Figure 6, to perceive sound. If we divide this
huge ratio into 12 parts, represented by increasing powers of 10, we
create the bel scale of loudness, with 12 bels representing the full range
from audibility to pain. Dividing each bel into 10 equally spaced ratios
yields the decibel, the measure of sound frequently used to describe
sound intensity. The full range of hearing is thus represented within a
range of about 120 decibels.
29
Sound Intensity
Threshold Of Hearing 0 dB
Grand Canyon North Rim 12 dB
Recording Studio 24 dB
Quiet Home 36 dB
Quiet Audience 42 dB
Office Ambiance 54 dB
Busy Street 68 dB
Factory 78 dB
Subway Train 94 dB
Fire Engine (onboard) 96 dB
Thunder (close up) 108 dB
Rocket Engines 152 dB
Figure 7. Decibel Measures of Various Sounds
The decibel is defined as: 20 log(A1/A2) where A1 and A2
represent two amplitudes or voltages. If ratios of powers (or watts) are
measured, the formula is 10 log (P1/P2). By defining the threshold of
audibility to be 0 dB, one can describe sounds of varying loudness with
respect to that threshold, as shown in Figure 7.
The ear exhibits heightened sensitivity to frequencies of about
3000 Hz and lessened sensitivity to frequencies above and below that
number, thus perceived loudness is also a function of frequency (to be
discussed.) This is why a shrill piezo-electric chip in a digital watch can be
heard above the voice of a university professor.
30
Pitch
Figure 8. A high note
The most important measure of musical sound is pitch, which
distinguishes most musical sound, even that of Figure 8, from
environmental sound. Most musical sounds possess a nearly regular rate
of vibration. When the vibration is harmonic, a musical pitch can be
perceived that corresponds to the frequency of the vibration. Just as the
ear discriminates among ratios of acoustical energy to perceive loudness,
the sensation of pitch arises from discrimination among ratios of
frequency.
Few musicians have the ability to identify a pitch when hearing it
31
alone, isolated from other known pitches. This ability, called "perfect
pitch," though quite valuable in certain circumstances and annoying in
others, is not necessary for musical development. More important for
most musicians is a good sense of "relative pitch," an ability to identify
the intervals among pitches quickly and reliably.
Interval Name Ratio
Octave 2:1
Perfect Fifth 3:2
Perfect Fourth 4:3
Major Third 5:4
Minor Third 6:5
Figure 8. Musical intervals and their frequency ratios
Musical intervals correspond to simple ratios among frequencies, as
illustrated in Figure 8. These intervals all occur in the harmonic series that
results from harmonic vibration.
Figure 9. First 6 partials of a cello string
Figure 9 shows the notated pitches that correspond to the first 6
harmonic partials that result when a cellist plays on the lowest open string
of the cello. “F” is the frequency of the first partial, also called the
fundamental. The second partial has twice the frequency of the first, and
is an octave higher. The third partial is three times the frequency and so
on. The terms “partial” and “harmonic” are sometimes used in confusing
ways. Some writers refer to the second partial as the first harmonic,
others use harmonic and partial interchangeably. The term “overtone” is
usually understood to apply to the partials other than the fundamental.
32
Musicians approximate these intervals in performance with results
that vary widely depending on the characteristics of the instrument and
the individual musician’s ability to anticipate the correct pitch and to
adjust the actual pitch in a musical fashion. However, fixed pitch
instruments such as the piano must be pre-tuned and cannot be
purposefully varied during performance.
The problem of tuning a scale using simple intervals is not a trivial
one, as is easily demonstrated. Imagine tuning a chromatic scale using the
intervals of perfect fifth and octave to create a circle of fifths folded into
one octave.
Begin with some frequency "f" that we will declare is the musical
note “C.” Multiply f by 1.5, the interval of the perfect fifth. If the
resulting frequency is greater than 2.0 (the octave above), divide the
frequency by 2, to transpose down one octave. Continue until the circle is
complete.
Note Frequency Frequency *1.5 Frequency/2
C f 1.5f
G 1.5f 2.25f 1.125f
D 1.125f 1.6875f
A 1.6875f 2.53125f 1.265625f
E 1.265625f 1.8984375f
B 1.8984375f 2.84765625f 1.423828125f
F#orGb 1.423828125f 2.1357421875f 1.0678710938f
Db 1.0678710938f 1.6018066407f
Ab 1.6018066407f 2.402709961f 1.2013549805f
Eb 1.2013549805f 1.8020324708f
Bb 1.8020324708f 2.7030487062f 1.3515243531f
F 1.3515243531f 2.0272865297f 1.0136432649f
“C” 1.0136432649f
Figure 10. Pythagoreas's Comma Revealed in the Circle of Fifths
Instead of ending up at frequency f we find ourselves noticeably out of
tune at 1.0136432649f, as Figure 10 demonstrates. The system of
tuning by perfect fifths is called Pythagorean tuning and the error is called
“Pythagoreas's comma.”
33
34
Note Name Frequency Frequency
C f f*r0
C# f*r f*r1
D f*r*r f*r2
D# f*r*r*r f*r3
E f*r*r*r*r f*r4
F f*r*r*r*r*r f*r5
F# f*r*r*r*r*r*r f*r6
G f*r*r*r*r*r*r*r f*r7
G# f*r*r*r*r*r*r*r*r f*r8
A f*r*r*r*r*r*r*r*r*r f*r9
A# f*r*r*r*r*r*r*r*r*r*r f*r10
B f*r*r*r*r*r*r*r*r*r*r*r f*r11
C f*r*r*r*r*r*r*r*r*r*r*r*r f*r12 = f*2
Figure 11. Computing the Equal Tempered Scale
All tuning systems contain a comma (error). The system of equal
temperament maintains the integrity of the octave and distributes the
comma among all the other intervals. In equal temperament the ratio of
the half-step is chosen so that 12 successive multiplications moves from
starting frequency f to exactly 2f, the octave above, as shown in Figure
11.
What is the ratio, "r," of the half-step in equal temperament? If "f"
is the frequency of middle C, then each semitone will have a frequency
that is r times the previous semitone. If r12 = 2 (the octave), then the
ratio of the equal-tempered half-step is the 12th root of 2, or about
1.059.
35
Timbre
Pitch and loudness aside, the tone quality of a musical sound is called
timbre and is not as easily represented in quantitative terms as pitch and
loudness. The timbre of a musical instrument is a recognizable quality that
is nevertheless subject to great variation over the pitch and loudness
range of an instrument. If a simple passage is played by a beginning
oboist and also by a consummate professional, the two performances will
have little in common except for the distinctive oboe timbre.
A musical sound has a temporal history. The beginning of the sound
contains distinctive characteristics that help to identify the instrument.
During the transition from silence to the main portion of sound, an
instrument may pass through several unstable states, introducing noise or
roughness. Although few musical sounds are ever absolutely stable, most
instrumental tones have a sustaining area in which the tone quality
changes only slightly. Early studies of instrumental timbre examined this
"steady-state" area by decomposing the sound into component parts.
Figure 11. Three cycles of a male singing “ah”
Examine the steady-state portion of some representative vocal
sounds. Here is a short excerpt from a recording of a male voice in Figure
11. The repetitive quality is what gives the sound a pitch. Each repetition
is called a cycle. Figure 11 shows 3 cycles of a male singing the vowel
"ah," as in "Bahamas." A cycle is often called a waveform. Notice that the
three cycles are very similar but not identical, although the peak
amplitude and period are almost constant from cycle to cycle.
36
Figure 12. Three cycles of a male singing “oo”
In Figure 12 you see a waveform display of the same voice singing
the vowel "oo," as in "cool." The period and amplitude are the same but
the shape of the waveform is smoother. The two sounds both share
characteristics of the male singing voice, but they have different vowel
qualities, and hence, different timbres.
A type of mathematical analysis called Fourier analysis can reduce
any periodic waveform, regardless of the mechanism that produced it, to
a sum of sine and cosine waves of varying amplitude, frequency and
phase. This representation of sound in "frequency space" (as opposed to
"time space", the waveform representation) is called the spectrum of the
sound, and plots amplitude against frequency instead of time.
A sine (or cosine) wave has no overtones. Thus the sine wave has
the simplest spectrum since it has only one component. The more angular
the waveform, the greater the number of sine waves required to
represent it, hence the spectrum is more dense.
Some Waveforms and Their Spectra
Many aspects of timbre can be discussed in spectral terms. If the
spectrum is harmonic, it comprises a fundamental partial plus a number of
higher partials with frequencies that are integer multiples of the
fundamental frequency. Spectra with more complex relationships among
their partials are said to be inharmonic and are associated with
clangorous, metallic sounds. Spectra without clearly defined partial
37
frequencies, consisting instead of broad bands of energy are noise-like,
and are associated with sounds that have an indefinite pitch. The lines in
non-noise spectra may be dense or sparse, and the energy in the
spectrum will usually diminish as the partial frequencies increase. This last
quality determines the relative "brightness" of the sound.
Figure 13. “Classic” electronic waveforms and their spectra
A few waveforms are indigenous to electronic music because they
are easily produced by analog electronics and provide a broad selection of
timbres. Some of these waveforms, shown in Figure 13, are the sine,
sawtooth, square, triangle and pulse: idiosyncratically simple waveforms
with well-defined spectra. The sine, as explained earlier, has only one
spectral line. By contrast, the sawtooth waveform has a spectrum
containing all the harmonic partials.
38
Partial # Relative Frequency Relative Amplitude
1 f 1
2 2f 1/2
3 3f 1/3
4 4f 1/4
…
n nf 1/n
Figure 14. Spectral Characteristics of the Sawtooth Wave
Figure 14 shows that the amplitude of each partial in the sawtooth
spectrum is inversely proportional to (i.e. the reciprocal of) the partial
number. Now, recall that the frequency of the partials doubles at the
octave. Due to the inverse relationship of amplitude to frequency, the
partial amplitudes decrease by half over the same octave interval. Using
the definition of the decibel, we can determine by calculation that a 50%
decrease in amplitude is equal to an attentuation of 6 decibels, because
20*log(1/2)= -6. And because the relationship extends to all the partials
of the sawtooth wave's spectrum, the sawtooth spectrum has a
continually declining amplitude curve of 6 decibels per octave.
Partial # Relative Frequency Relative Amplitude
1 f 1
2 2f 0
3 3f 1/3
4 4f 0
5 5f 1/5
6 6f 0
…
n nf 1/n if n is odd, else 0
Figure 15. Spectral Characteristics of a Square Wave
The square wave has the same inverse amplitude property (and
hence, the same 6 dB slope-per-octave) but only has odd numbered
partials, as listed in Figure 15. To preserve the relationship between
partial number and partial frequency, it is useful to retain the missing
39
partials in the numbering scheme and note their absence with a null
amplitude (a zero.)
Partial # Relative Frequency Relative Amplitude
1 f 1
2 2f 0
3 3f -1/9
4 4f 0
5 5f 1/25
6 6f 0
7 7f -1/49
etc.
Figure 16. Spectral Characteristics of a Triangle Wave
Like the square wave, the triangle wave also has only odd numbered
partials, but the amplitude of the partials is inversely proportional to the
square of the partial number, indicated in the last column of Figure 16. As
the frequency increases by one octave, the amplitude falls to 25% of that
of the lower octave, so the amplitude slope-per-octave of the triangle
wave is twice as great as that of the square and sawtooth waves:12 dB.
The value of 12 db is calculated as follows: 20*log(1/4)= -12. Another
interesting and distinctive of the triangle wave is that every second non-
null partial in the triangle wave (i. e. partial 3,7,11, etc.) has a negative
amplitude.
Because pulse waves have only two values ("on" or "off"), the
percentage of time during each waveform that the value is on is called
the duty cycle. So it follows that the square wave is said to be a pulse
wave with a 50% duty cycle. As the duty cycle becomes shorter, the
slope-per-octave decreases, meaning that the higher-frequency partials
become more prominent and the sound becomes brighter. A spectrum
containing all partials at equal amplitude corresponds to a waveform with
an infinitely short duty-cycle.
One final curiosity: if the duty cycle can be expressed as the
reciprocal of a integer (1/2, 1/3, 1/4, etc.), there will be a node (a partial
with zero amplitude) at every multiple of the denominator. Just as the
square wave's duty cycle of 1/2 corresponds to nodes at each even
numbered partial, a pulse wave with a duty cycle of 1/3 has spectral
40
nodes at partials 3, 6, 9, etc. Nodes in the spectrum are audible, and
contribute to the timbre of a waveform.
If we combine all the possibilities: harmonic and inharmonic spectra,
the amplitude slope-per-octave, and spectra nodes, we begin to get a
sense of the high level of detail contained in a steady-state (unchanging)
musical sound. When a sound unfolds in time, with complex onset
transients at the beginning and subtle irregularities throughout, it
comprises a richly textured event, and a complete description of its
timbre will be lengthy and complex. It is tempting to over-simplify a
discussion of musical timbre, but to do so would be terribly misleading,
because timbre is a composite perception influenced by many different
acoustical events.
FOR FURTHER READING
Other important topics in musical acoustics include resonance,
formants, reverberation, musical instrument and concert hall design, and
mechanisms for recording and reproducing sound. The curious student
may want to refer to the following excellent texts:
BACKUS, JOHN. The Acoustical Foundations of Music. New York: W. W.
Norton & Co., 1977.
COOK, PERRY, Ed. Music, Cognition and Computerized Sound. Boston: MIT
Press, 1999.
DODGE, CHARLES, AND JERSE, THOMAS. Computer Music: Synthesis,
Composition, and Performance. New York: Schirmer Books, 1997.
PIERCE, JOHN R. The Science Of Musical Sound. New York: W.H. Freeman
and Company, 1992.
WINCKEL, FRITZ. Music, Sound and Sensation. New York: Dover
Publications, 1967.
41
CHAPTER III
A FEW WORDS ABOUT COMPUTERS
Computers and computing machinery can be used in almost every
musical activity, including performance, synthesis, publishing, recording,
editing and sound reproduction. The computer is prominent in some of
these applications but is hidden in others.
When a personal computer is equipped with hardware and software
for musical use, the role of the computer is usually obvious. Such
computers continuously run a program called an operating system that
coordinates the flow of information and provides support for an
application program, such as a music-printing program, which the user will
run for some limited period of time. These familiar computers are based
on a specialized chip called a microprocessor and contain other supporting
circuitry to make the computer versatile. Some music and sound gadgets
use dedicated microprocessors (customized to some degree for special
use) running a single special program called an imbedded operating
system. These computers and programs are so integrated into special
hardware that one is generally not aware that computing hardware is
involved. Some examples of "hidden" computers used by musicians
include: MP3 players, compact disc players, digital sound recorders,
keyboards, tuners, metronomes and the ubiquitous cell phone. These
devices all use a computer as controller, signal processor and storage
manager.
At the time of this writing, computers consist of several
microprocessors containing an arithmetic-logic units, special high speed
memory used to store (cache) instructions and data, and connections to
the external bus of the computer. It will also contain RAM, ROM and
controllers for disks, data ports, Ethernet, audio/visual and other wired or
wireless input/output devices including a keyboard, mouse and a display
screen.
Special purpose applications may have highly streamlined forms: A
42
CD player will feature a high-speed mass storage device (the physical CD
"reader") that uses a laser beam and tracking system, a stored program
for the operation of all the mechanical and electronic systems in the
player, and a streamlined input keypad and display screen. Special
hardware converts the data read from the CD to electrical signals. Both
the general purpose personal computer and the CD player have the
characteristics of computers: they are programmably predictable and can
rapidly and reliably perform millions of operations per second. They both
have input/output devices, a microprocessor and a mass storage device,
and execute a pre-existing program.
Computers of all sorts employ a clock to schedule events and
manage processes, and the clock rate of a computer is one measure of its
speed. If two computers are identical except for clock speed, and if both
machines executed identical programs that were not constrained by the
operating speed of any external components, the computer with the
faster clock rate would complete the program proportionately sooner.
However, external constraints can affect the actual speed of operation.
Figure 18. A congested road slows traffic
Think of a familiar situation in which potential speed and actual speed can
differ, such as driving a car on a heavily traveled road, such as the one
pictured in Figure 18. The distance from my house to Temple University is
about 10 miles. There are few hills or turns, so if I was the sole driver on
the route and knew that the stop signs were removed and the traffic
lights were all green, I could make the trip in 12 minutes. My neighbor, in
a more powerful car, might make the trip in 6 minutes. But once other
drivers are present and traffic signals enabled, my neighbor and I each
take about 30-50 minutes to travel the same ten miles, and my car may
be less expensive to operate than my neighbor’s car.
43
Similarly, the clock rate is only one measure of computing power.
The capacity of the data pathway, called the "bus," is another significant
factor. At the risk of abusing the commuter metaphor, imagine that 10
workers need to be transported from point A to a branch office located at
point B. Assuming that their employer owns only one vehicle, it will take
more than twice as long to move them in a vehicle seating 5-9 people
than in one seating 10 or more.
The interface between the computer and its peripheral devices
affects the overall performance of the system, too. In audio applications
the transfer of sound data from the mass storage system to the
computer and to the special hardware than converts the data into
electrical signals should proceed without interruption. Relatively short
delays in data transfer can cause noticeable errors in audio playback.
As recently as the 1990s, high performance audio computing applications
required special hardware and operating system design, and not all
personal computers could support high quality audio recording and
reproduction.
But manufacturers and programmers have paid increasing attention
to audio and video processing capacity with each new generation of
products, and now the composer or video artist has a wide range of
machines from which to choose.
Figure 19. Two images of a pear, taken from
http://livedocs.adobe.com/flex/2/docs/images/ascii_art.jpg
44
No matter how "user friendly" a computer may appear, it is a
complicated system. Each picture on the screen, including Figure 19, each
character in a word processor document like this one, and each sample in
a sound recording has been coded using binary numbers. One does not
have to understand binary numbers to use a computer, anymore than one
has to understand compound interest to take out a loan, and ignorance is
bliss until there are problems to solve.
Reversing or recasting the question, "Do I have to know about binary
numbers?" yields a bigger, better question: "Will my understanding of the
increasingly digital world around me be advanced by knowing something
about binary numbers?
This is a yes-or-no question, an example of a binary choice. A binary
number system needs only two symbols: "0" and "1." The symbols mean
the same thing In a binary (base 2) number that they mean in the more
familiar decimal (base 10) system where we might speak of the "one's"
place, the "ten's" place, etc., and use a larger set of symbols:
0123456789
which are scaled by the power of ten their place represents.
The decimal number “1995” is a number signifying the sum of 4
quantities: 1 thousand, 9 hundreds, 9 tens and 5 ones or:
1 x 103 + 9 x 102 + 9 x 101 + 5 x 100
Similarly, decimal fractions use combinations of powers of ten raised to
negative exponents:
3.142 = 3 x 100 + 1 x 10-1 + 4 x 10-2 + 2 x 10-3
Binary numbers are especially interesting to study because they are,
at some basic level, the representation a computer uses. The basic idea of
binary is the same as decimal. In the binary number system, two symbols
(0 and 1) represent the presence or absence of various powers of 2, all
added together. Thus the binary number 1010 is understood to mean:
45
1 x 23 + 0 x 22 +1 x 21 + 0 x 20
Thus the binary number 1010 equals 10 in decimal notation.
Often binary digits are grouped by fours and leading zeros are added
to make them easier to read. Eight bits so grouped together are called a
byte, and 4 bits a nybble. Other number systems include the octal
system, which is based on powers of 8 and uses the symbols 0 1 2 3 4 5
6 7, and the hexadecimal system which is based on powers of 16.
Hexadecimal numbers have an exotic look to them because letters are
used for the digits corresponding to decimal 10, 11, 12, 13, 14, and 15,
so that hexadecimal numbers use the symbols:
0123456789ABCDEF
Binary numbers existed before modern computing machinery.
Communications engineers used the "bit" (a contraction of "binary digit")
as a measure of information. A single bit represented the smallest unit of
information, just enough to resolve the uncertainty in a one-out-of-two
choice. A light switch sets the lighting circuit into one of two states, on
or off, unlike a dimmer control which offers a number of possible light
levels. Binary numbers can also represent logical conditions like "true" and
"false". Binary switches are reliable and easy to engineer, and computers
are full of them.
Numbers can be used to represent other information than quantity.
They can represent codes and labels, for example. Telephone numbers are
decimal numbers that serve as codes for circuit connections. Street
addresses are a good example of codes that are partly quantitative. If we
are looking for 204 Rock Road and have passed 212 and then 210, we
figure is it just ahead. Yet few travelers would try to get to 10251 Long
Lane by driving exactly 100 blocks past 251 Long Lane. Other numbers
represent coded messages. In radio-speak "10-4" means "so long" and
"86" means "delete." Other examples probably come to mind.
Text can be numerically encoded by using a number for each key on
the typewriter keyboard. One standard coding scheme is the American
Standard Code [for] Information Interchange, or ASCII. Figure 19 is an
example of an ASCII text file shown in hexadecimal notation (left) and
printable characters (right):
46
Figure 19. ASCII encoding of simple text data
In example 19, above, spaces are coded as the hexadecimal number
20, which corresponds to the decimal number 32, which means "SPACE."
But images can be encoded as numbers as well. In black and white bit-
mapped images, a 0 indicates white picture element (pixel) and a 1
indicates black pixel. The amount of information required to encode a
picture of text is much greater than the information required to encode
the text using ASCII or some similar method. In example 20, you see the
hexadecimal listing of a pictorial bit-map of the top right-hand corner of
example 19.
Figure 20. Numerical codes for white space
47
Each "0000"' corresponds to a small dot-sized piece of white space
on the page. It takes a lot of dots to make up a picture. Consider
this: a FAX transmission of a typed letter contains a highly detailed
bit-mapped picture of the entire letter. Electronic mail usually
contains only ASCII text. Which method is more efficient? Clearly, text is
more efficient than pictures of text, even if extra data is added for
information about fonts and layout. ASCII typically uses one byte per
character. Unicode, a more recently developed format, uses two bytes
per character, and allows for all manner of fonts besides those typically
used for roman text.
Programming Basics
The computers commonly employed today are highly reliable,
deterministic machines. To be of use, a computer processes a series of
instructions pertinent to some data, and generates one ore more results.
A series of instructions, along with the data necessary to use them,
comprise a computer program. The conventions for interpreting the
program make up a programming language. Programs used to carry out a
particular task are application programs (or applications, or apps), while
programs that create an environment for the user are called operating
systems.
Early operating systems were highly dependent on the particular
computer hardware on which they ran, written in low-level languages tied
to the electrical structure of a particular processor. However, in the early
1970s a language named C was created at Bell Labs that could be used
to write operating systems that could run with minimal adjustment on a
wide variety of hardware. Dennis Ritchie, in a paper about the evolution of
the Unix operating system, wrote in 1979:
Perhaps the most important watershed occurred during 1973,
when the operating system kernel was rewritten in C. It was at this
point that the system assumed its modern form; the most far-
reaching change was the introduction of multi-programming. There
were few externally-visible changes, but the internal structure of the
system became much more rational and general. The success of this
effort convinced us that C was useful as a nearly universal tool for
systems programming, instead of just a toy for simple applications.
48
Today, the only important Unix program still written in assembler
is the assembler itself; virtually all the utility programs are in C, and
so are most of the applications programs.
This development also meant that processor-intensive applications,
such as sound synthesis and recording, could also be written in processor
independent languages. One famous computer music language, also
beginning at Bell Labs, was “Music”, by Max Mathews, and eventually
became Csound at MIT by composer Barry Vercoe.
(http://www.csounds.com/community/history/)
Synthesis in C
After Music, Bell created Music2, Music3, and Music4. Joan Miller, a
talented programmer and avocational violinist, translated Music4 from an
in-house language, BEFAP, into the widely used language named
FØRTRAN. The version, known as Music4bf, was released to universities
for experimental use.
University professors and graduate students began to modify,
expand and improve Music4bf, usually to adapt it to a specific computing
machine, such as Music11 tailored for the PDP11 minicomputer, or
Music360 for the IBM mainframes. Eventually, Barry Vercoe, a professor
at MIT, rewrote the program in the C programming language. His creation,
CSOUND, has been in constant use since 1985, and is actively maintained
by users around the world. You can find it at: https://csound.com
Csound, like the earlier Music programs, uses a model of an orchestra
and a score. The orchestra comprises a number of instruments that play
“notes” from the score. Each note has an identifying instrument number
used to route the rest of the note to the proper instrument. A note also
has a starting time and a duration. When the time is right, an instrument
gets a note, turns on and begins playing until its duration is over.
Pick A Number
Although the underlying representation is always binary, computers
support many kinds of more familiar numbers. Some common types are
integers, which are signed whole numbers plus zero. 2001, -1, and
49
10,000 are all examples of integers. Numbers with a decimal point
followed by a fraction are called floating point, or simply floats. 98.6,
3.14159, 8.00 are all floats. Numbers are also used to convey logical
values such as “true” and “false.” Typically 0 represents false, and non-
zero (such as 1) represents true. Figure 21 shows a truth table for
several logic gates, devices that perform simple logical operations.
Figure 21. Truth table for four logic gates.
Each gate has two inputs, A and B. The AND gate’s output is true
only if both inputs are true. Its evil twin, the NAND gate, outputs the
exact opposite. Similarly with OR and NOR.
The circuit shown in figure 22 uses two NAND gates to build a “flip-
flop”—a one bit unit of memory.
Figure 22. Arranging two NAND gates to make a flip flop.
If you send a 1 to the Set input, NAND gate 2 emits a 0, which appears at
output Q’ (read Not Q), and that 0 is fed into one side of gate 1, which
emits a 1 which appears at Q. It is fed into the other of gate 2, and the
cycle can repeat with no change. The 1 is stored until the Reset signal is
received.
50
Thus the flip-flop is the basic unit of RAM. With gates and flip-flops
one can build counters and registers, and so on. Because of increasing
component density and the declining cost of building blocks like RAM,
increasingly sophisticated user interfaces make us forget about what’s
going on inside. For consumers, that can be a positive development. For
arts technologists it can good news, too, but someone has to know
what’s going on below the surface in order to create new methods, new
creative approaches, and new arts and entertainment.
51
CHAPTER IV
SAMPLING THEORY
Sampling Theory tells us that continuous functions such as the variations
of amplitude in a waveform or the variations of atmospheric pressure in sound
can be represented by a series of discrete samples. Furthermore, the theory
describes the constraints of frequency and amplitude that exist in the sampled
system, and implies the hardware configuration needed for the digital
manipulation of sound.
Figure 21. Popular Singer Ada Jones experiments with the telegraph
Before the telephone, the telegraph provided a way to communicate using
electrical signals. A simple code of long and short duration patterns,
corresponding to letters of the alphabet, formed the basis for telegraph
52
messages that were sent over a telegraph wire or cable. The sender used a
telegraph “key” to tap out the patterns, and a “sounder” to hear the response
from the other end, as shown in figure 21, which also shows the patterns for
the letters on a card tacked to the wall.
Transcontinental undersea cables connected the continents, but the
construction and maintenance of these cables was tremendously expensive.
Engineers tried transmitting the patterns at high speeds, and invented clever
schemes for sending multiple messages at once. The process seemed to be
limited by the cable’s tendency to leak high frequencies, especially over the
very long distances involved.
Figure 22. Harry Nyquist, developer of sampling theory
The theoretical basis for the digital representation of sound through
samples depends on the work of Harry Nyquist, pictured in figure 22, a
mathematician at the American Telephone and Telegraph Company, who
published papers in the 1920's about telegraph transmission speed. Nyquist
broke complex signals down into a sum of sine waves of various frequencies and
demonstrated that if one sends N different current values per second, all
frequencies above 1/2 N are redundant.
53
Figure 23. A recording thermometer
Consider an intuitive example of the limits of sampling. Imagine that you
want to record the temperature in your backyard on a summer afternoon. You
could purchase a $1000 thermometer, such as the one in figure 23, with a
chart recorder that would draw a line in ink on paper wrapped around a drum
rotated by a clockwork mechanism driven by an electric motor.
The stylus would remain on the paper chart at all times and every
variation in temperature would be plotted on the chart. The chart might
look like figure 24.
Figure 24. Continuous Temperature on a Pleasant Afternoon
54
The numbers 70, 72, 74, etc. represent the temperature in degrees and
the numbers on the x axis represent the time in minutes. There are several
noticeable artifacts (artificial representations not related to the actual data) in
your chart including a high temperature spike just after 220 minutes and blob of
ink at 360 minutes. The blob results when you stopped the recorder without
first lifting the pen, and the spike probably happened when you swatted the fly
with your magazine and hit the thermometer too, almost knocking it off the
table.
Another way to record the temperature would be to buy a $5 thermometer
and use a pencil and paper to write down the temperature every 10 minutes.
Then your data might look like figure 25.
Minutes Degrees
20 74
30 74.5
40 75
50 75.5
60 76
70 76
80 76.5
90 76.5
100 76.5
110 76.5
120 76.5
130 76.5
140 77
150 77
160 (asleep) estimate: 77
170 77
180 77
[etc.]
Figure 25. Discrete Temperatures on a Pleasant Afternoon
Clearly, each method has captured the temperature history. The second
method used discrete numerical samples to represent the continuous data
whereas the first method created an analogous representation by drawing a
picture. There is also an artifact in the second method where a sample was
55
missed at 160 minutes. Rather than omit the sample entirely, an estimated
sample was created by taking an average of the preceding and following values
(or else by simply repeating the previous value.)
How often need samples be taken? In the backyard science experiment the
10 minute interval seems to have been a good choice because most of the
variations the thermometer can detect were captured by monitoring every 10
minutes. One could sample every minute, or even every 5 seconds and approach
the detail of the strip chart, but a lot more paper would be needed and there
would be no time to do anything else in between the samples. Conversely,
samples taken every 100 minutes would not capture the shape of the curve.
Intuition alone should tell us that the higher the sampling rate (or, to say the
same thing another way, the shorter the sampling period) the more detailed will
be the representation, and more data will need to be stored.
FREQUENCY LIMITS IN SAMPLING SYSTEMS
To answer the question about appropriate sampling rate more precisely we
go back to Harry Nyquist. He tells us that we must use a sampling rate of S to
capture frequencies as high as S/2. Stated another way, if samples are taken
every T seconds, then events with periods of 2T seconds or greater will be
correctly represented. The notion of the Nyquist limit is easy to understand
when studying digital representation of audio signals. If we know that the
frequencies in an audio signal fall between 20 Hz and 10000 Hz, then Nyquist's
limit says that we must sample at a rate of at least 20000 samples per second
to capture the highest frequencies. Conversely, a system sampling at 20000 Hz
can correctly represent frequencies up to 10000 Hz.
What about frequencies greater than the Nyquist limit? They are
represented incorrectly, creating an condition called aliasing or foldover.
Pictured in figure 26 is a sine wave with a frequency of 10000 Hz. The black
dots represent the values when the wave is sampled at 20000 Hz. Now study
figure 27, a picture of a 30000 Hz sine wave sampled at the same rate, 20000
Hz:
56
Figure 26. A 10 kHz wave sine wave, S=20k
Figure 27. A 30 kHz wave sine wave, S=20k
Note that the black dots are in the same place! The second figure shows that
the 30000 Hz signal will be represented by a 10000 Hz wave instead.
The formula for calculating an alias is:
F' = | NS - F |
where: F' is the represented frequency (perhaps an alias),
N is an integer equal to or greater than 0 chosen so that the
value of |NS - F| is less than Nyquist's limit,
S is the sampling rate,
F is the input frequency (before the sampling.)
(The vertical bars “| |” indicate absolute value.)
Using the formula to make a chart showing the effects of foldover for
some typical values we see in figure 28 that once the Nyquist limit is reached an
increase in input frequency creates a decrease in the represented (and aliased)
57
frequency:
F' N S F
1000 0 10000 1000
2000 0 10000 2000
2450 0 10000 2450
4000 0 10000 4000
5000 0 10000 5000
4000 1 10000 6000
2870 1 10000 7130
2000 1 10000 8000
1000 1 10000 9000
1000 1 10000 11000
2000 1 10000 12000
Figure 28. Examples of calculating F' = | NS - F |
Aliasing introduces frequencies that might coincide with frequencies
already present, causing reinforcement or cancellation, or that might be
unrelated to the frequencies already present adding clangorous inharmonic
partials. Aliasing can be very annoying, and is regarded as one of the most
pernicious artifacts in digital sound systems. To eliminate aliasing, frequencies
above Nyquist must be filtered out before sampling an external signal, and
avoided when samples are calculated during editing, enhancing, processing or
synthesizing a digital signal.
AMPLITUDE LIMITS
The other aspect of sampling of great importance is the sample resolution.
Returning to the temperature analogy briefly, what would happen if you
monitored the body temperature of a fever victim and recorded the results to
the nearest 5 degrees? Rounding is also called quantizing because the rounded
data can vary only by a specific quantum determined by the degree of rounding.
What is your annual income rounded to the nearest $1000? The nearest
$100,000? Compare your net worth with that of someone worth
$1,000,000,000. How accurately do you need to calculate your net worth to
make a meaningful comparison?
In figure 29, you will see 4 representations of a sine wave. The resolution
58
decreases from left to right as the function is quantized to fewer and fewer
levels:
Figure 29. A quantized sine wave, 4 ways
The first form is a little grainy, but still looks like it represents a smoothly
varying function. The intermediate forms look more jagged, and the final form
(quantized to just two levels) is a square wave. The difference between the
source function and the quantized representation is a group of spurious partials
that are dependent on the relationship between the input frequency and the
sample rate. These extra partials were not in the original signal and thus are
characterized as noise, even though they are quite different from the "hiss" of
analog recording. Analog tape hiss is relatively broad band noise (distributed
evenly over a large frequency range instead of concentrated in partials like
overtones) that is present whether or not a signal is recorded on the tape. Since
the auditory system tends to ignore steady-state phenomena of long duration,
tape hiss is likely to noticed only when the tape starts or stops whereas digital
noise occurs only when a signal is present and then may contain distinctive
frequency components that are very obvious, so it is more disruptive than
analog noise.
The higher the resolution of the sample, the less digital noise will be caused
by the rounding errors. Sample resolution is affected by the accuracy of the
hardware used to convert electrical signals into numeric form and back again, by
the accuracy of the calculations performed on the samples, and by the number
of digits in the sample word. The amplitude range of a sample system is often
described as the ratio of the maximum voltage level to the maximum voltage
error caused by rounding, a value called the "signal to error ratio" and often
measured in decibels. Since each additional bit in a binary number doubles the
number of voltage levels that can be specified and consequently reduces the
digital noise, each bit increases the signal:error ratio by about 6 dB.
In summary, the frequency limit in a digital system is one half the sampling
59
rate and the amplitude range is roughly equivalent to 6 dB per bit in the sample
word. Aliasing results when the Nyquist limit is exceeded and digital noise is
excessive when the sample resolution is too coarse. Both aliasing and digital
noise are more objectionable than analog noise and distortion but can be
reduced to arbitrary levels by increasing the sample rate and resolution with the
expense of more accurate hardware and increased data storage requirements.
Read more about these topics in:
PIERCE, JOHN R. An Introduction to Information Theory. New York: Dover
Publications, 1980.
POLMAN, KEN C. Principles of Digital Audio. Carmel, Indiana: Howard Sams
, 1992.
60
CHAPTER V
SOUND RECORDING AND EDITING
Armed with the concepts introduced in the preceding chapters,
understanding typical sound recording and editing applications should be
straightforward and simple. We will follow the process from acoustical
event to digital representation and back again, with intermediate stops in
between.
Figure 30. An early acoustic recording session
(http://www.charm.rhul.ac.uk/content/KCL_resources/beardsley_brief_history.html)
In early recordings, such as the RCA Victor session shown in figure 30,
musicians gathered around a metal funnel that focused the energy of the sound
pressure function (the motion of air molecules corresponding to the physical
vibration of an instrument or a voice). At the end of the funnel was a stylus
that cut a groove into a revolving wax cylinder. For playback, the stylus would
be set back to the beginning of the cylinder where it would track the undulating
groove cut during the recording process. The motion of the stylus would set the
61
playback funnel in motion and the shape of the funnel would help the small
motions of the stylus move enough air molecules for the sound to be audible.
This recording process was completely mechanical and sounds primitive today,
but much of today's recording jargon refers to the early cylinder recorder.
Impressive sounding people speak of "cutting" a record, laying down "tracks",
and being in a "groove." Like, wow.
The details of the process change when we employ electricity and
electronics. The first step in recording sound electrically is to convert the
sound pressure function into an electrical signal. The general class of
instruments that perform this conversion are called transducers, and the
obvious example is the microphone. There are many varied microphone designs,
but the characteristics are similar enough for generalization.
Figure 31. A broadcast quality microphone
Most microphones respond to a broad range of frequencies and generate
an electrical signal analogous to the sound pressure function. The signal from
the microphone is a very low voltage. A very loud sound would cause a typical
microphone, such as the one pictured in Figure 31, to produce a signal with a
peak intensity less than 1/1000 of a volt (one millivolt.) Such a signal would
need 60 db of amplification before further processing for recording. Small
amounts of electrical noise, including 60 hz hum, are amplified as well, so
engineers devise strategies to limit this noise. The amplified signal is
62
transmitted electrically, and recorded in variety of ways. During playback, the
recording mechanism retrieves the information in the proper time order and
reconstructs the electrical signal. After appropriate amplification, the signal is
sent to a loudspeaker, a transducer that changes electrical energy into
acoustical energy.
The evolution of recording technology in the 20th century began with
mechanical systems, moving through a combination of analog electrical and
magnetic systems and ending with digital electronics and laser beams. The
technology changed gradually, so that at any given time some hybrid
technology was in place. 78 rpm recordings might have been made with electric
microphones and amplifiers but would be played back on a crank-driven
mechanical phonograph. Digitally recorded and mastered recordings are still sold
today as audio cassettes, an analog format.
Each new generation of recording and playback systems improved on some
aspect of the past. Wider frequency range, lower distortion and increased
playing time accompanied the triumph of the 78 rpm disk over the Edison
cylinder, the long playing record (LP) over the 78, and the compact disc (CD)
over the LP. Music recording and editing caused profound changes in the way
musicians worked and the way people consumed music. Semi-literate musicians
with a convincing personal style could become "stars" to thousands of people in
a matter of months. Recorded music was more easily used as a tool in
advertising and propaganda as incessantly repeated and insidiously banal
"jingles" served to trigger conceptual association in the Pavlovian part of the
listener. Music suddenly was worth a lot of money (see Figure 32.)
63
Figure 32. How music becomes an “industry”
Although sound recording equipment was initially developed in the
United States, the commercial development of digital audio recording was
accomplished by the Japanese state coordinated industries through careful
planning and controlled competition. According to their plan, we will remain at a
hybrid analog-digital stage until digital microphones and loudspeakers are
introduced. To bridge the gap from the analog representation of sound to a
sampled representation, special circuits called Analog to Digital Converters are
used to convert the electrical signal into digital form, and Digital to Analog
Converters perform the reverse operation. Both devices are both known by
acronyms: A/D and D/A converters, or sometimes simply ADC and DAC.
Let us imagine we will make a classical recording. First, we must rent a hall
and maybe a piano. We must hire performers, an engineer, and possibly a piano
tuner. Some one should coordinate the recording session and act as producer,
following the score and deciding when a passage has been recorded at least
once, but preferably twice, without extraneous noise, technical flaws ("what's
that buzzing sound?") or uneditable musical mistakes, and making a log of all
the segments recorded ("takes"), with notes about the quality of each. The
64
score is marked up as well, indicating the in and out points for groups of takes.
These producer's notes will be later used to make an editing "map," or detailed
list of take numbers and editing information.
The engineer will position microphones based on past experience, and
string cables to carry the electrical signal out of the performance area. The
electrical signals will be amplified, converted to digital form and recorded on a
semi-permanent medium such as a computer’s disk drive. The takes may be
played back through headphones or loudspeakers to test microphone placement
or musical effect. So far, the block diagram for our set-up looks like Figure 32.
Figure 32. Stages in Digital Recording and Playback
The two filters pass frequencies below Nyquist's limit and sharply attenuate
those above it. Filters that pass frequencies below some value but not above
are called low-pass filters. The filter before the ADC is called an anti-aliasing
filter because it removes frequencies from the input signal that could cause
aliasing. The filter after the DAC is called a smoothing filter because it
interpolates between the successive samples produced by the conversion
process, eliminating any "staircase" shapes from the output signal. The blocks in
the diagram do not always represent individual components, and indeed A/D
and D/A converters and their associated filters are often available in a digital
recording system, although higher fidelity and lower noise is generally achieved
by using separate components.
Early recordings were not edited at all. The recording machine started to
run, and the performers recorded the complete selection with no stopping. The
resulting recording was either discarded, or reproduced and distributed or
broadcast. Editing became widely available with the introduction of tape
recording because the tape could be cut with a razor blade and attached to
other tape segments with splicing tape. The "outtake" was thus created. Tape
recording also opened the possibility of recording signals on different divisions
of the tape (multiple tracks) to be mixed together at a later time. Tracks could
be recorded and re-recorded in succession so that recordings could be built up
65
over time ("overdubbing"). Opera singers could overdub their high notes after
the primary session was complete and the orchestra paid and sent home.
Through repeated takes, popular artists could try over and over to find the
correct nuance for a phrase, and record production became an ongoing process
with musicians laying down tracks in New York, adding vocals in Los Angeles and
strings in Philadelphia, perhaps a children's choir in Brussels, etc.
The total dependence on editing and incremental production delayed the
initial acceptance of digital recording because the editing process was
cumbersome at first. Most digital tape recorders used a format that could not
be spliced with a razor blade, and early sound editing consoles cost as much as
large studio mixing consoles. However, as the cost of computer disk storage
dropped and the performance of personal computers increased, digital editing
became affordable, replacing tape splicing.
One reason for the popularity of digital editing is the extensive range of
features it affords. The length of overlap of the sections of sound to be joined
can be chosen precisely, as can the shape of the function used to control the
quick cross-fade that produces the virtual splice. Tracks can be slipped back and
forth in time and minute areas of sound can be repeated to elongate a note.
Pitch and rhythm errors can be corrected within certain limits, and a vast array
of noise-reducing and spectrum enhancing tools can be applied during the
editing process.
To edit the session tapes we imagined recording earlier in this chapter, the
editing engineer, using a visual display and sound playback, builds up a list of
edit points and associated data. On playback, the computer looks up sounds in
the edit list and joins them together according to the recorded function data.
When the editing process is complete, a new datafile is created that contains a
linear presentation of the recorded, edited and enhanced samples, and that file
can be used to fabricate compact discs for mass distribution, or formatted for
download.
SAMPLING RATES
Several different sample rates are used in commercial production, but most
are chosen to provide a frequency range of about 20000 Hz. Many of these
systems also use 16 bit samples to approach a dynamic range of about 96 dB,
but all of these standards are likely to evolve in the future. Personal computers
also support digital recording and playback of sound with various sample rates
66
and bits of resolution. Here, in Figure 33, are some examples.
Device Sample Rate (Hz) Sample Size (bytes)
Compact Disc 44,100 2
DAT recorder 44,100, 48000, 96000 2, 2.5
Synthesizer 30,000-50,000 1.5 - 2
Computer very wide range 1-4
Figure 33. Comparison of Sample Rates and Sizes
Each channel in digital system is typically a separate sample stream with its
own storage requirements. To calculate the memory needed in bytes for an
audio segment multiply the length in seconds by the sample rate by the sample
size in bytes by the number of channels: M = LRSC
One minute of CD quality audio stored on a personal computer would
require 60 x 44100 x 2 x 2 = 10,584,000 bytes or roughly 10 megabytes
per minute. We see that the 70-80 minute audio CD is a high density storage
medium, indeed, and by extension the DVD and Blu-Ray disks are remarkable
technical achievements.
67
Figure 34. A popular way to listen to music
Ten or so years ago, a remarkable development in music recording
technology was the miniature music player, one popular example of which is
shown in Figure 34. Three developments made these players possible: high
density, low cost computer memory; miniature disk drives; and audio
compression technology.
Compression technology looks for clever ways to reduce the amount of
data needed to provide an acceptable rendition of the uncompressed
information. A popular format, “MP3” encoding, reduces a file to about 10% of
its original size. MP3 coding, as well as the compression schemes used in DVD
sound tracks, operate on sound in the frequency domain, instead of the time
domain, and use information about human perception to eliminate features
whose absence most listeners won’t notice. The transformation of sound from
time to frequency domain and back again, forms the basis for many current
audio technologies including the cell-phone. This transformation is made
possible by the theories of mathematician and physicist Jean Baptiste Joseph
Fourier, a contemporary of Beethoven.
68
The plan for digital audio technology and the history of its
execution through the early 1980's is detailed in:
NAKAJIMA, H., OOI, T., FUKUDA, J. and IGA, A. Digital Audio Technology.
Blue Ridge Summit, PA: Tab Books, 1983.
To learn more about digital audio engineering, the DAT and the CD:
POLMAN, KEN C. Principles of Digital Audio. Carmel, Indiana: Howard Sams,
1992.
69
CHAPTER VI
SYNTHESIS
If a sound recording can be represented as a stream of numbers, then a
sound recording can be created synthetically by computing a stream of numbers.
Any sound that can be represented by samples can be synthesized, if one simply
knows how.
Figure 35. A random process
We could toss a coin, like the one in Figure 35, and let the outcome set the
first bit in the first sample to a 1 or 0, then continue onto the next bit, etc.
Obviously, flipping a coin once per bit would take a long time, since there are 8
bits in one byte and about 10 million bytes in a minute of CD quality stereo sound.
Even if you flipped the coin once per second, had someone else write down the
result and worked 50 hours a week, it would take you 8 years and 6 months to
complete your coin-toss composition.
70
Figure 36. J. R. Pierce
What would it sound like? John Pierce, shown in Figure 36, the late Director
of Research in Communications Principles at Bell Telephone Laboratories answers
in a chapter called "Information Theory and Art" in his book, An Introduction to
Information Theory, which appeared in its first edition in 1961. I justify the lengthy
quotation because I found it highly stimulating and thought provoking when I first
read it over 20 years ago:
Some years ago when a competent modern composer and professor of
music visited the Bell Laboratories, he was full of the news that musical
sounds and, in fact, whole musical compositions can be reduced to a series
of numbers. This was of course old stuff to us....
We had considered something that the composer didn't appreciate.
In order to represent fairly high-quality music, with a bandwidth of 15,000
cycles per second, one must use 30,000 samples per second, and each one
of these must be specified to an accuracy of perhaps one part in a thousand.
We can do this by using three decimal digits (or about 10 binary digits) to
designate the amplitude of each sample. A composer could exercise
complete freedom of choice among sounds by specifying a sequence of
30,000 three-digit decimal numbers a second. This would allow him to
choose from among a number of twenty-minute compositions which can be
written as 1 followed by 108 million 0's—an inconceivably large number.
Putting it another way, the choice he could exercise in composing would be
300,000 bits per second.
71
Here we sense what is wrong. We have noted that by the fastest
demonstrated means, that is, by reading lists of words as rapidly as possible,
a human being demonstrates an information rate of no more than 40 bits per
second. This is scarcely more than a ten-thousandth of the rate we have
allowed our composer. Further, it may be that a human being can make use
of, can appreciate, information only at some rate even less than 40 bits per
second. When we listen to an actor, we hear highly redundant English
uttered at a rather moderate speed. The flexibility and freedom that a
composer has in expressing a composition as a sequence of sample
amplitudes is largely wasted. They allow him to produce a host of
"compositions" which to any human auditor will sound indistinguishable and
uninteresting. Mathematically, white Gaussian noise, which contains all
frequencies equally, is the epitome of the various and unexpected. It is the
least predictable, the most original of sounds. To a human being, however, all
white Gaussian noise sounds alike. It's subtleties are hidden from him, and he
says that it is dull and monotonous.
If a human being finds monotonous that which is mathematically more
various and unpredictable, what does he find fresh and interesting? To call a
thing new, he must be able to distinguish it from that which is old. To be
distinguishable, sounds must be to a degree familiar.... Of course a composer
wants to be free and original, but he also wants to be known and
appreciated. If his audience can't tell one of his compositions from another,
they certainly won't buy recordings of many different compositions. If they
can't tell his compositions from those of a whole school of composers, they
may be satisfied to let one recording stand for the lot....
To use the analogy of language, the composer will write in a language
which the listener knows. He will produce a well-ordered sequence of musical
words in a musically grammatical order. The words may be recognizable
chords, scales, themes, or ornaments. They will succeed one another in the
equivalents of sentences or stanzas, usually with a good deal of repetition.
They will be uttered by the familiar voices of the orchestra. If he is a good
composer, he will in some way convey a distinct and personal impression to
the skilled listener. If he is at least a skillful composer, his compositions will
be intelligible and agreeable.
Pierce indirectly suggests that the samples must be clumped together in units
that have more meaning to composers and listeners, and that composing by
choosing sample values at random won't work. The two ideas are related because
any mechanism for choosing samples deterministically (non-randomly) requires a
72
higher level of sound organization. At the simplest level, most musical notes have
a characteristic amplitude history called an envelope, and each note has unique
properties of pitch and loudness. Each note also belongs to a family of timbres
specific to the instrument that plays the note. It is tempting to jump to the
conclusion that the basic unit of synthesis ought to be the note, and that notes
have the simple properties outlined above. Indeed, digital keyboards provide
exactly that sort of model and provide sounds that are certainly predictable and
familiar.
Figure 37. Edgard Varése
Yet, other composers have creating compelling and distinctive compositions
that use organizational elements that are more broadly defined than the humble
"note." Edgard Varése, shown with a tape recording in Figure 37, worked with
"sound-objects," and other composers looked to electronic means to transform
the sounds of nature in an interesting way. These methods are more analogous to
sculpture or photography than Pierce's literary association of word and note. The
freedom of creating sound by computing numbers is an awesome thing and many
programmers begin by attempting to model or mimic a pre-existing instrument or
device including analog synthesizers, which contain several kinds of sound
modules. These modules include oscillators, envelope generators, amplifiers, filters
and performer interfaces such as a keyboard. Let's take a slightly closer look at
these "black boxes."
73
SYNTHESIS MODULES
The oscillator produces periodic waveforms with a programmable amplitude
and frequency. The first analog synthesizers provided a choice of several simple
waveforms, including the sine, sawtooth, triangle and pulse waves. These
waveforms are easily distinguished from one another and are easy to generate
with analog components. To create a digital oscillator a programmer can store one
or more waveforms in computer memory and read the samples cyclically to create
a periodic function. Frequency control in digital oscillators can be achieved either
by varying the sample rate or by skipping or repeating samples according to a
sample increment. The increment need not be a whole number as long as some
method keeps track of the fractional part of the sample increment. Amplitude
control is achieved by multiplying the sample found in the waveform "look-up
table" by a scaling value.
Envelope generators provide a non-periodic (transient) function of time
that is useful for controlling the non-periodic aspects of a note such as its
loudness contour or timbral evolution. Analog synthesizers offered several stages
of envelope control, typically time constants to control the attack time (elapsed
time from the beginning of the note to the first peak) the decay time (elapsed
time from the first peak to the beginning of the sustaining portion of the sound)
and the release time (elapsed time from the end of the sustaining position to
silence.) The sustain level is usually a percentage of the first peak value. The
analog envelope generator (often called an ADSR, for Attack-Decay-Sustain-
Release) produced a voltage that could be used to control the gain of an amplifier,
the cutoff frequency of a filter, or the pitch of an oscillator. The envelope was
initiated by trigger and gate signals from an external source such as a keyboard. It
is simple to program a routine to provide digital control patterns, so programmers
created more complex envelope controllers.
Analog voltage-controlled amplifiers used a control voltage produced by
an envelope generator to vary the amplitude of a signal produced by an oscillator.
Other amplifiers controlled the relative amplitude of several signals as they were
mixed together. In the digital domain, a signal can be scaled by an envelope
function by simple multiplication, and scaled signals mixed together by addition.
The analog voltage-controlled filter produced time-varying timbres. A
typical synthesizer arrangement routed a overtone-rich source such as a sawtooth
oscillator into a voltage-controlled low-pass filter and a voltage-controlled
74
amplifier. An analog filter of this sort would often feature a variable cutoff
frequency, and a feedback circuit that could create a sharp resonance peak at the
cutoff frequency. Digital filtering involves more sophisticated programming and
calculation than do oscillators and envelope generators and so did not appear in
the first digital keyboards, although composers who composed using software
synthesis had an extensive library of filters with which to work.
The keyboard in an early analog synthesizer was a series of switches that
divided a control voltage into steps that could produce a chromatic scale when
routed to an oscillator. The keyboard also produced trigger and gate voltages to
control the starting and stopping of an envelope generator. The first keyboards
were monophonic, producing only one control voltage at a time, regardless of the
number of keys depressed. Keyboards in digital synthesizers produce numeric
codes instead of control voltages and are polyphonic, sensing many simultaneous
key depressions. Composers using software means need not be restricted by the
keyboard controller model and can provide data to the program at any time and in
any form.
SYNTHESIS TOPOLOGIES
Using the basic units described above, a number of different synthesis
schemes can be implemented using digital computers, D/A converters and
smoothing filters. In a real-time system the samples are computed faster than the
sample rate and sent directly to the D/A converter. In non-real-time systems the
samples must be stored in large files for later conversion to sound.
Some interesting synthesis topologies include fixed waveform synthesis,
additive synthesis, subtractive synthesis, modulation synthesis and synthesis by
analysis. This is far from a comprehensive list since new methods and
improvements to existing methods continue to be developed in corporate
laboratories, university studios and on people's desktops around the world.
75
Figure 38. Fixed Waveform Synthesis
In fixed waveform synthesis, illustrated schematically in Figure 38, the
amplitude, frequency and duration of each note are passed to the appropriate
envelope generators that create amplitude and frequency envelopes. The oscillator
uses a stored waveform that may have many partials. This topology was one of
the first used for digital synthesis and is both easy to understand and efficient to
compute. The principal shortcomings of fixed waveform synthesis are that tones
of medium to long duration sound artificially static because the spectrum does not
change over the course of the note, and because the overtones must be strictly
harmonic to be stored together in a simple table. However, it is possible to cross-
fade among several waveforms during the course of a note, and to store many
waveforms so that different notes can have different timbres. Modern sampling
instruments might store several dozen cycles recorded from an actual instrument
instead of a single waveform, allowing for some variation in partial tuning.
Wavetables are usually specified as a list of frequencies and amplitudes rather than
a list of sample values in order to know the highest frequency component and thus
avoid aliasing.
76
Figure 39. Additive Synthesis
In additive synthesis, shown in Figure 39, a separate oscillator with an
independent amplitude and frequency envelope creates a sine wave for each
partial of the sound, permitting inharmonic partials and continual spectral
evolution, but requiring extra computation time. Also, a significant quantity of
input data has to be provided either directly by the composer or through
additional programming. The system pictured above will produce only five partials,
consequently it will probably produce dull sounding timbres at lower frequencies. It
is possible to isolate the components of the sound that require inharmonic partials
and store the other partials in a single wavetable that could be mixed with sine
partials, and also possible to simplify the envelope computation by sharing scaled
envelope values among related partials.
The author has taken another approach to additive synthesis that calls for
specification of spectral values at a level higher than the sample but lower than
the note in a series of data chunks called frames. In each frame a duration value is
followed by a list of frequencies and amplitudes. The program reads the data list
and interpolates between successive frames to create an output sample file. In
spite of the somewhat cumbersome data list required, the method offers an easy
interface to frequency space and a long list of utilities that can be used to
manipulate the data files.
77
Figure 40. Subtractive Synthesis
A source rich in frequencies is sent to one or more filters that shape the
spectrum by attenuating certain frequency bands, hence the name, Subtractive
Synthesis. As shown in Figure 40, the source could be noise or a special type of
pulse generator that produces all partials with frequencies up to but not exceeding
the Nyquist limit. Subtractive synthesis is an effective topology for producing
voice-like sounds. A digital filter is surprising simple to program--averaging every
two samples together produces a simple low-pass filter, which lets low frequencies
continue unimpeded but reduces or attenuates frequencies about the cutoff point.
Taking the difference of every two samples produces a high-pass filter, which
attenuates low frequencies, but passes frequencies about the cutoff point. More
sophisticated filters require more calculations per sample and the internal
representation of the sample values may need to be much more accurate than the
final sample resolution, so careful planning is needed with filters.
78
Figure 41. John Chowning
FM synthesis came about through the research and experimentation of
Stanford University composer John Chowning, pictured in Figure 41. In the 1970’s,
he was exploring faster alternatives in using computer sound with digital synthesis.
Until then, people using computers for synthesis would write programs, submit
these programs to a mainframe computer, receive a reel of tape containing the
computer outputs, and playback this computer tape on a real-time system,
converting the information into analog form. This equipment was very expensive
and too much time elapsed between computation and listening Chowning was
interested in computing sound samples in real time—so that a computer could
compute the samples fast enough to send them directly to a D/A converter. The
problem in using the fastest algorithms (stored waveform synthesis) was the
bland quality of the static waveforms. Musical sounds are not static; but creating
dynamic waveforms meant additional computation time.
79
Figure 42. Frequency Modulation Synthesis
Chowning experimented with frequency modulation similar to the process
used in FM radio transmission. Frequency modulation, like vibrato, is the periodic
fluctuation of a carrier frequency at a modulation rate. Like vibrato, the
modulation has two attributes: depth and frequency. As shown in Figure 42, an
oscillator has amplitude and frequency inputs enabling it to produce a sine wave.
Connecting a second sine oscillator to the first will enable modulation of the first
sine wave. The frequency of the modulating oscillator interacts with the frequency
of the first, while the amplitude of the modulating oscillator determines the depth
of modulation, or frequency deviation. Frequency modulation can create a rich
spectrum with just two sine oscillators, but the precise calculation of the spectrum
is complicated, and real-time synthesis equipment is useful for making fine
adjustments to the FM sound. The degree of modulation in FM synthesis in called
the Index of Modulation, and is proportional to the frequency deviation divided by
the frequency of modulation, and is represented by the formula:
80
I = d/fm
The frequencies created by the modulation process are called “sidebands.” To
determine their frequencies, combine the sum and difference of the carrier
frequency and the integer multiples of the modulating frequency according to the
formula:
fc ± Kfm
(where fc stands for carrier frequency, fm for frequency of modulation, and K =
0,1, 2, etc.)
To see how these sidebands are distributed, consider Figure 43. Assume that
fc = 100 Hz and fm = 200 Hz and calculate the sidebands.
K fc + Kfm fc - Kfm
------------------------------------------
0 100 Hz 100 Hz
1 300 Hz -100 Hz
2 500 Hz -300 Hz
3 700 Hz -500 Hz
4 900 HZ -700 Hz
5 1100 Hz -900 Hz
etc.
Figure 43. Typical Distribution of Sidebands in FM Synthesis (fm/fc=2)
The calculation shows the first few frequencies possible in the output
spectrum. The spectrum has a fundamental frequency of 100 Hz with odd
numbered partials. The amplitudes of each side band frequency have not yet been
calculated, but may be positive negative or zero. Also, note the many negative
frequencies: a negative frequency is like a positive frequency with its amplitude
reversed. Those partials with the same frequency but opposite amplitude will
cancel one another.
81
When the carrier and modulating frequency are the same, the even partials
are also present. Now consider Figure 44, where fc = fm = 100 Hz:
K fc + Kfm fc - Kfm
-----------------------------------------
0 100 Hz 100 Hz
1 200 Hz 0 Hz
2 300 Hz -100 Hz
3 400 Hz -200 Hz
4 500 Hz -300 Hz
5 600 Hz -400 Hz
etc.
Figure 44. Typical Distribution of Sidebands in FM Synthesis (fm/fc = 1)
When fc=fm=100, the fundamental frequency is still 100 Hz, but both even
and odd partials are present. The ratio of the modulator to the carrier is a major
determinant of the resulting spectrum. Ratios other than simple integers produce
inharmonic, metallic spectra. The amplitudes of the sidebands are determined
using Bessel functions (found in standard function tables). The complex shape of
the various Bessel functions and the cancellation of sidebands by one another
create a complicated result that has been used to good effect in numerous
compositions. The efficiency of the FM algorithm and the rich timbral possibility
contributed to the success of one of the first digital keyboards, Yamaha's DX-7,
shown in figure 45.
Figure 45. The Yamaha DX-7 digital synthesizer
82
SYNTHESIS BY ANALYSIS
These sophisticated methods begin by analyzing an existing soundfile to
generate a set of parameters used to reconstruct the soundfile, usually with
modification. Although a description of the methods lies outside the scope of this
already length chapter, note that these synthesis methods offer the potential for
varying the speed of a passage without changing its pitch and vice-versa. There
are other interesting possibilities for transformation by operating on the
intermediate data including cross-synthesis, where the characteristic of one sound
is impressed onto another, or where one sound is gradually transformed into
another. Most of the these methods work in the frequency domain, including the
Phase Vocoder and Linear Predictive Coding and are described in careful detail by
Dodge and Moore (see below.)
DODGE, CHARLES, AND JERSE, THOMAS. Computer Music: Synthesis, Composition,
and Performance. New York: Schirmer Books, 1985.
MATHEWS, M.V. The Technology of Computer Music. Cambridge, MA: MIT Press,
1969.
MOORE, F.R. Elements of Computer Music. Englewood Cliffs, NJ: Prentice-Hall,
1990.
PIERCE, JOHN R. An Introduction to Information Theory. New York: Dover
Publications, 1980.