0% found this document useful (0 votes)
50 views11 pages

Real-Time Sonographic Sound Processing

This document introduces sonographic sound processing, which involves converting sound to images, applying graphical effects to the images, and converting images back to sound. There are two main challenges: 1) images are static while sound is dynamic, and 2) images exist in 2D while sound exists in 1D. The document proposes solutions to these challenges, such as gradually reading an image to produce sound over time, and chopping a sound signal into vertical slices to create a 2D image that can be read horizontally to reconstruct the original sound.

Uploaded by

Serj Ionescu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views11 pages

Real-Time Sonographic Sound Processing

This document introduces sonographic sound processing, which involves converting sound to images, applying graphical effects to the images, and converting images back to sound. There are two main challenges: 1) images are static while sound is dynamic, and 2) images exist in 2D while sound exists in 1D. The document proposes solutions to these challenges, such as gradually reading an image to produce sound over time, and chopping a sound signal into vertical slices to create a 2D image that can be read horizontally to reconstruct the original sound.

Uploaded by

Serj Ionescu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Sonographic

 Sound  Processing  
 
 
Introduction  
 
 
The  purpose  of  this  article  is  to  introduce  the  reader  to  the  subject  of  Real-­‐Time  
Sonographic  Sound  Processing.  Since  sonographic  sound  processing  is  based  on  
windowed  analysis  it  belongs  to  the  digital  world.  Hence  we  will  be  leaning  
against  digital  concepts  of  audio  and  image  or  video  throughout  this  article.        
 
Sonographic  Sound  Processing  is  all  about  converting  sound  into  image,  using  
graphical  effects  on  the  image  and  then  converting  the  image  back  into  sound.    
 
When  thinking  about  sound-­‐image  conversion  we  stumble  across  two  immanent  
problems:    
 
-­‐ First  one  is  that  images  are  static  while  sounds  can  not  be  static.  Sound  is  
all  about  pressure  fluctuations  in  some  medium,  usually  air,  and  as  soon  
as  the  pressure  becomes  static,  the  sound  ceases  to  exist.  In  digital  world,  
audio  signals  that  may  later  become  audible  sounds,  are  all  a  consequence  
of  constantly  changing  numbers.    
-­‐  The  second  problem  is  that  image,  in  terms  of  data,  exist  in  2D  space,  
while  sound  exists  only  in  1  dimension.  Gray  scale  image  for  instance  can  
be  understood  as  a  function  of  pixel  intensity  or  brightness  in  relation  to  
2  dimensional  space,  that  is  width  and  height  of  an  image.  On  the  other  
hand  audio  signal  is  a  function  of  amplitude  in  relation  to  only  one  
dimension,  which  is  time  (see  fig.  1  and  fig.  2).    
 
 
 

 
Figure  1  Heavily  downsampled  image  (10x10  pixels)  of  black  spot  with  gradient  on  white  
background.  On  right-­‐hand  side  of  figure  1  is  numerical  representation  of  an  image  that  is  two  
dimensional  array  or  a  matrix.  
 
 
Figure  2  Heavily  downsampled  piece  of  a  sine  wave  (10  samples,  time  domain).  On  right-­‐hand  
side  there  is  numerical  representation  of  a  waveform  that  is  one  dimensional  array.  
 
 
At  this  point  we  see  that  we  could  easily  transform  one  column  or  one  row  of  an  
image  into  a  sound.  If  our  image  would  have  a  width  of  44100  pixels  and  height  
of  22050  pixels  and  our  sampling  rate  would  be  set  at  44.1  KHz,  that  would  mean  
that  one  row  would  give  us  exactly  one  second  of  sound  while  one  column  would  
give  us  exactly  half  second  of  sound.  Instead  of  columns  and  rows  we  could  also  
follow  diagonals  or  any  other  arbitrary  trajectories.  As  long  as  we  would  read  
numbers  (=switch  between  various  pixels  and  read  their  values)  at  the  speed  of  
sampling  rate  we  could  precisely  control  the  audio  with  image.  And  that  is  the  
basic  concept  of  wave  terrain  synthesis,  that  could  also  be  understood  as  an  
advanced  version  of  wavetable  synthesis,  where  the  reading  trajectory  “defines”  
its  momentary  wavetable.  
 
So  the  answer  to  the  first  problem  is  that  we  have  to  gradually  read  an  image  in  
order  to  hear  it.  So  for  instance  if  we  would  to  hear  “the  sound  of  my  gran  
drinking  a  cup  of  coffee”,  we  would  need  to  read  each  successive  column  from  
bottom  to  the  top  and  in  that  manner  progress  through  the  whole  width  of  the  
picture  (*as  mentioned  earlier,  the  trajectory  of  reading  is  arbitrary  as  long  as  we  
would  read  all  the  pixels  of  an  image  only  once).  That  would  translate  an  image,  
that  is  frozen  in  time,  into  sound,  that  has  some  time  limited  duration.  But  since  
we  would  like  to  only  process  some  existing  sound  with  graphical  effects,  we  
need  to  first  generate  the  image  out  of  sound.    
 
What  we  could  try  for  a  start  is  to  chop  “horizontal”  audio  into  slices  and  place  
them  vertically  so  they  would  become  columns.  That  would  give  us  an  image  out  
of  sound  and  as  we  would  read  the  columns  successively  from  bottom  to  the  top  
we  would  actually  gradually  read  the  picture  from  left  to  right  (fig.  3)  and  that  
would  give  us  back  our  initial  sound.  
 
 
 
Figure  3  Chopping  audio  signal  into  slices  and  placing  them  vertically  as  columns  of  a  matrix  
where  each  matrix  cell  represent  a  pixel  and  audio  sample  at  the  same  time.  By  reading    
successive  columns  vertically  we  progress  through  the  image  from  left  to  right.  
 
 
At  that  point  we  would  actually  have  an  image  made  out  of  audio  data  and  we  
could  easily  play  it  back.  But  that  kind  of  image  would  not  give  us  any  easily  
understandable  information  about  the  audio  itself  –  it  would  look  like  some  kind  
of  graphical  noise  (fig.  4a).  Hence  we  would  not  be  able  to  control  the  audio  
output  with  graphical  effect  since  we  would  not  understand  what  in  the  picture  
reflects  certain  audio  parameters.  Of  course  we  could  actually  modify  audio  data  
with  graphical  effects  but  again,  our  outcome  would  exist  only  in  frame  of  noise  
variations.    One  of  the  reasons  that  for  lies  in  the  fact  that  our  X  and  Y  axes  would  
still  represent  only  one  dimension  –  time.    
 
 
Spectrogram  &  FFT  
 
 
In  order  to  give  some  understandable  meaning  to  audio  data  we  need  to  convert  
the  audio  signal  from-­‐time  domain  into  frequency-­‐domain.  And  that  can  be  
achieved  via  windowed  Fast  Fourier  Transform  (FFT)  or  more  accurately  short-­‐
time  Fourier  transform  (STFT).  If  we  apply  STFT  to  our  vertical  audio  slices,  the  
result  would  be  a  mirrored  spectrogram  (fig.  4b).  That  kind  of  graphical  audio  
representation  can  be  easily  interpreted  but  has  at  the  same  time  half  of  the  
unwanted  data  –  the  mirrored  part.  In  order  to  obtain  only  the  information  we  
need  we  can  simply  remove  the  unneeded  part  (fig.  6),  which  in  a  way  happens  
by  default  in  Max’s  pfft~  object.  In  other  words,  just  as  we  have  to  avoid  using  
frequencies  above  Nyquist  in  order  to  prevent  aliasing  in  digital  world,  it  is  
pointless  to  analyze  audio  all  the  way  up  to  the  sampling  rate  frequency  since  
everything  above  Nyquist  freq.  will  allways  give  us  wrong  results  in  form  of  
mirrored  spectrum.  
 
 

 
 
Figure  4a  (left)  Vertically  plotted  slices  of  audio  (few  words  of  speech)  as  time-­‐domain  
waveforms  from  “bird’s  eye  view”.  
Figure  4a  (right)  Spectrogram  or  vertically  plotted  slices  of  audio  (few  words  of  speech)  
transformed  into  frequency-­‐domain  via  STFT  
 
.  
 
First  step  of  STFT,  which  is  the  most  common  and  useful  form  of  FFT,  is  to  apply  
a  volume  envelope  to  a  time  slice.  Volume  envelope  could  be  a  triangle  or  any  
symmetrical  curve  such  as  Gaussian,  Hamming,  etc.  The  volume  envelope  in  
question  is  in  the  world  of  signal  analysis  called  a  window  function  and  the  
result,  time  slice  multiplied  by  a  window  function,  is  called  a  window  or  a  
windowed  segment.  At  the  last  stage  of  STFT,  FFT  is  calculated  for  the  windowed  
segment  (fig.  5).    At  that  point,  FFT  should  be  considered  only  as  a  mathematical  
function,  that  translates  windowed  fragments  of  time-­‐domain  signals  into  
frequency-­‐domain  signals.    
 
 
 

 
Figure  5  Time  Slice  multiplied  by  window  function  (Roads,  1996,  p.550).  
 
 
In  order  to  perform  STFT,  windows  have  to  consist  of  exactly  2!  samples.  That  
results  in  familiar  numbers  such  as  128,  256,  512,  1024,  2048,  4096  etc.  These  
numbers  represent  the  FFT  size  or  in  our  case,  twice  the  height  of  a  spectrogram  
image  in  pixels  (fig.  6).  At  that  point  we  can  also  further  adopt  the  terminology  
from  the  world  of  FFT  and  call  each  column  in  the  spectrogram  matrix  a  FFT  
frame  and  each  pixel  inside  the  column  a  FFT  bin.  Therefore  each  useful  half  of  
!!
FFT  frame  consists  of   !  FFT  bins,  where  2!  is  the  number  of  samples  in  
windowed  segment.  Number  of  FFT  frames  corresponds  to  the  number  of  pixels  
in  horizontal  direction  of  an  image  (spectrogram)  while  number  of  half  FFT  bins  
correspond  to  the  number  of  pixels  in  vertical  direction.  Hence:  spectrogram  
image  width  =  nr.  of  FFT  frames;  spectrogram  image  height  =  nr.  of  FFT  bins  
divided  by  2.    
 
 
 
 
Figure  6  The  height  of  the  spectrogram  (useful  half  of  each  FFT  frame)  is  two  
times  smaller  than  the  length  of  the  extracted  time  segment.    
 
 
If  our  goal  would  be  only  to  create  a  spectrogram,  we  would  need  to  convert  a  
linear  frequency  scale  of  our  spectrogram  into  logarithmic  one,  in  order  to  have  a  
better  view  into  the  most  important  part  of  the  spectrum.  Also  we  would  need  to  
downsample  or  “squash”  the  height  of  our  spectrogram  image,  since  the  images  
might  be  too  high  (4096  samples  long  window  would  give  us  2048  vertical  pixels  
on  our  spectrogram).  Both  actions  would  result  in  significant  data  loss  that  we  
can  not  afford  when  preparing  the  ground  for  further  spectral  transformations.  
Hence  if  we  want  to  preserve  all  the  data  obtained  during  the  FFT  analysis,  our  
linear  spectral  matrix  should  remain  intact.  Of  course  it  would  be  sensible  to  
duplicate  and  remap  the  spectral  data,  to  create  a  user  friendly  interactive  
sonogram,  through  which  one  could  interact  with  original  spectral  matrix,  but  
that  would  complicate  things  even  further  so  it  will  not  be  covered  in  this  article.  
For  now,  the  image  of  our  spectrogram  should  consist  of  all  the  useful  data  
obtained  during  the  FFT  analysis.      
 
It  is  also  important  to  emphasize  that  window  size  defines  the  lowest  frequency  
that  can  be  detected  in  time-­‐domain  signal  by  FFT.  That  frequency  is  called  a  
fundamental  FFT  frequency.  As  we  know,  frequency  and  wavelength  are  in  
reverse  relationship,  therefore  very  low  frequencies  have  very  large  wavelengths  
or  periods.  And  if  the  period  is  bigger  than  our  chosen  window,  so  it  can  not  fit  
inside  the  window,  it  can  not  be  detected  by  the  FFT  algorithm.  Hence  we  need  
large  windows  in  order  to  analyze  low  frequencies.  Large  windows  also  give  us  a  
good  frequency  resolution  of  frequency-­‐domain  signals.  The  reason  for  that  lies  
in  the  harmonic  nature  of  FFT.    
 
FFT  can  be  considered  as  a  harmonically  rich  entity  that  consists  of  many  
harmonically  related  sine  waves.  First  sine  wave  is  a  FFT  fundamental.  For
instance if we would choose 2!  or 512 samples long window at the sampling rate of
44.1kHz, our FFT fundamental would be 86.13Hz (44100Hz/512=86.13Hz). And all
the following harmonics would have a frequency of !  ×  (!!"  !"#$%&'#(%)),
where N is an integer. All these virtual harmonics are basically FFT probes: FFT
compares time-domain signal with each probe and checks if that frequency is also
present in the tested signal. Hence smaller FFT fundamentals, as a consequence of
larger windows, mean that we have much more FFT probes in the range from
fundamental and all the way up to the Nyquist frequency. That is why large windows
give us a good frequency resolution.

So if FFT probe detects the testing frequency in the time-domain signal, it tells us
how strong it is (amplitude) and what is its phase. Since the number of probes always
equals the number of samples in chosen time window, we get two information for
each sample of time-domain audio signal – that is amplitude and phase. Hence the full
spectrogram from figure 4b, without the unwanted mirrored part, actually looks like
this (fig. 7):

Figure  7  Graphical  representation  of  amplitude  (left)  and  phase  (right)  


information  of  a  frequency-­‐domain  signal.    
 
 
We  see  that  phase  part  of  a  spectrogram  does  not  give  us  any  meaningful  
information  (*when  we  look  at  it)  and  is  therefore  usually  not  displayed.  But  as  
we  will  see  later,  the  phase  information  is  very  important  when  processing  
spectral  data  and  transforming  it  back  into  time-­‐domain  via  inverse  STFT  and  
overlap-­‐add  (OA)  resynthesis.      
 
As  we  said,  the  size  of  the  window  defines  the  FFT  fundamental  and  hence  the  
frequency  resolution  of  the  FFT.  Therefore  we  need  big  windows  for  good  
frequency  resolution.  But  here  comes  the  ultimate  problem  of  FFT  –  frequency  
and  time  resolution  are  in  inverse  relationship.  FFT  frames  (analyzed  windows)  
are  actually  snapshots  frozen  in  time.  There  is  no  temporal  information  about  
analyzed  audio  inside  a  single  FFT  frame.  Therefore  an  audio  signal  in  frequency-­‐
domain  can  be  imagined  as  successive  frames  of  a  film.  Just  as  each  frame  of  a  
film  is  a  static  picture,  each  FFT  frame  presents  in  a  way  a  “static”  sound.  And  if  
we  want  to  hear  a  single  FFT  frame  we  need  to  constantly  loop  through  it,  just  as  
a  digital  oscillator  loops  through  its  wavetable.  Therefore  a  sound  of  a  FFT  frame  
is  “static”  in  a  way  that  is  “static”  a  sound  of  on  oscillator  with  a  constant  
frequency.  For  a  good  temporal  resolution  we  therefore  need  to  sacrifice  the  
frequency  resolution  and  vice-­‐versa.  
 
STFT  is  in  praxis  always  performed  with  overlapping  windows.  One  function  of  
overlapping  windows  is  to  cancel  out  the  unwanted  artifacts  of  amplitude  
modulation  that  occurs  when  applying  a  window  to  a  time  slice.  As  we  can  see  
from  figure  8,  overlapping  factor  2  is  sufficient  for  that  job.  The  sum  of  
overlapping  window  amplitudes  is  constantly  1,  which  cancels  out  the  effect  of  
amplitude  modulation.  Hence  we  can  conclude  that  our  spectrogram  from  figure  
7  is  not  very  accurate,  since  we  haven’t  used  any  overlap.  
 
 

 
 
Figure  8  Windowing  with  triangular  window  functions  with  overlap  factor  2.  
The  sum  of  overlapping  window  amplitudes  is  constantly  1,  which  cancels  out  
the  effect  of  amplitude  modulation.  
 
 
 
 
Another  role  of  overlapping  is  to  increase  the  time  resolution  of  frequency-­‐
domain  signals.  According  to  Curtis  Roads,  “an  overlap  factor  of  eight  or  more  is  
recommended  when  the  goal  is  transforming  the  input  signal”  (Roads,  1996,  
p.555).  
 
When  using  for  instance  overlap  8,  it  means  that  FFT  is  producing  8  times  more  
data  as  when  using  no  overlap  –  instead  of  one  FFT  frame  we  have  8  FFT  frames.  
Therefore  if  we  would  like  to  present  the  same  amount  of  time-­‐domain  signal  in  
a  spectrogram  as  in  figure  6,  we  would  need  8  times  wider  image  (spectrogram).  
Also  the  sampling  rate  of  all  the  FFT  operations  has  to  be  overlap-­‐times  higher  as  
in  the  time-­‐domain  part  of  our  patch  or  a  program  (*in  case  we  want  to  process  
all  the  data  in  real-­‐time).  
 
Now  as  we  introduced  the  concept  of  spectrogram,  it  is  time  to  take  a  look  into    
the  central  tool  of  sonographic  sound  processing,  the  phase  vocoder.  
 
 
Phase  Vocoder  
 
 
Phase vocoder (PV) can be considered as an upgrade to STFT and is consequentially
a very popular analysis tool. The added benefit is that it can measure a frequency
deviation from its center bin frequency as said by Dodge and Jerse (1997, p. 251). For
example if the STFT with fundamental frequency 100Hz is analyzing a 980Hz sine
tone, the FFT algorithm would show the biggest energy in bin with index 10 - in other
words at the frequency 1000Hz. PV is on the other hand able to determine that the
greatest energy was concentrated -20Hz below the 1000Hz bin giving us the correct
result 980Hz.

Calculation of mentioned frequency deviation is based on phase differences between


successive FFT frames for a given FFT bin. In other words, phase differences are
calculated between neighboring pixels for each row in spectral matrix containing
phase information (fig 7, right). In general, phase vocoder does not store spectral data
in a form of a spectrogram but in form of a linear stereo buffer (one channel for phase
and the other one for amplitude).

Phase hence contains a structural information or information about temporal


development of a sound. ”The phase relationships between the different bins will
reconstruct time-limited events when the time domain representation is resynthesized”
(Sprenger, 1999). Bin’s true frequency therefore enables a reconstruction of time
domain signal on a different time basis (1999). In other words, phase difference and
consequently a running phase is what enables a smooth time stretching or time
compression in phase vocoder. Time stretching or time compression in the case of our
phase vocoder with spectrogram as an interface, is actually its ability to read the FFT
data (spectrogram) with various reading speeds while preserving the initial pitch.

Since inverse FFT demands phase for signal reconstruction instead of phase
difference, phase difference has to be summed back together and this is called a
running phase. If there is no time manipulation present (*reading speed = the speed of
recording), running phase for each frequency bin equals phase values obtained
straight after the analysis. But for any different reading speed running phase is
different from initial phases and responsible for smooth signal reconstruction. And
taking phase into consideration when resynthesizing time stretched audio signals is
the main difference between phase vocoder and synchronous granular synthesis
(SGS).

As soon as we have a phase vocoder with spectrogram as an interface, and when our
spectrogram actually reflects the actual FFT data, so all the pixels represent all useful
FFT data (*we can ignore mirrored spectrum), we are ready to perform sonographic
sound processing. We can record our spectral data, in form of two gray scale spectral
images (amplitude and phase), import them into Photoshop or any similar image
processing software and playback the modified images as sounds. We just need to
know what sizes to use for our chosen FFT parameters. But when we are preforming
spectral sound processing in real-time programming environments such as Max/Jitter,
we can modify our spectrograms “on the fly”. Jitter offer us endless possibilities of
real-time graphical manipulations so we can tweak our spectrogram with graphical
effects just like we would tweak our synth parameters in real-time. And the
commodity of real-time sonographic interaction is very rear in the world of
sonographic sound processing. Actually I am not aware of any commercial
sonographic processing product on the market, that would enable user a real-time
interaction.

Another important thing when graphically modifying sound is to preserve the bit
resolution of audio when processing images. Audio signals have accuracy of 32 bit
floating point or more. On the other hand, a single channel of ARGB color model, has
the resolution of only 8 bits. And since we are using gray scale images, we only use
one channel. Hence we need to process our spectrogram images only with objects or
tools, that are able to process 32 or 64 bit floating point numbers.

When doing spectral processing is also sensible to consider processing of spectral


data on GPUs, that are much more powerful and faster with their parallel processing
abilities as CPUs. One may think that all image processing takes place on GPU but
that is not correct in many cases. Many graphical effects are actually using CPU to
scale and reposition the data of an image. So the idea is to transfer the spectral data
from CPU to GPU, perform as many operations as possible on the GPU by using
various shaders, and then transferring the spectrogram back to CPU where it can be
converted back into time-domain signal.

The only problematic link in the chain, when transferring data from CPU to GPU and
back, is the actual transfer to and from the graphics card. That is in general slowing
the whole process to certain extent. Therefore we should have only one CPU-GPU-
CPU transfer in the whole patch. Once on the GPU, all the desired openGL actions
should be executed. Also we have to be careful that we do not loose our 32 bit
resolution in the process, which happens in Max by default when going from GPU to
CPU, because Jitter assumes that we need something in ARGB format from the GPU.

For the end of this article I should also mention one very useful method, discovered
by J.F. Charles (Charles, 2008), and that is the interpolation between successive FFT
frames. As we said earlier, FFT frames are like frames in a movie. Hence we progress
through spectrogram, when reading it back, by jumping from one FFT frame to
another. And in the same manner as we notice switching between successive still
pictures when playing back a video very slowly, we notice switching between
successive FFT frames (“static” loops) when reading a spectrogram back very slowly.
And that is known as frame effect of phase vocoder. Hence in order to achieve a high
quality read back when doing extreme time stretching, we can constantly interpolate
between two successive FFT frames and read only the interpolated one.

References:

Charles, J. F. 2008. A Tutorial on Spectral Sound Processing Using Max/MSP and


Jitter. Computer Music Journal 32(3) pp. 87–102.

Dodge, C. and Jerse, T.A. 1997. Computer Music: Synthesis, Composition and
Performance. 2nd edition, New York: Thomson Learning

Roads, C. 1996. The Computer Music Tutorial. Cambridge, Massachusetts: The MIT
Press

Sprenger, S. M. 1999. Pitch Scaling Using The Fourier Transform. Audio DSP pages.
[Online Book]. Available:
http://docs.happycoders.org/unsorted/computer_science/digital_signal_processing/dsp
dimension.pdf [Accessed 5 August 2010].

Tadej Droljc, spring 2013

You might also like