Backdoor: Making Microphones Hear Inaudible Sounds: Nirupam Roy, Haitham Hassanieh, Romit Roy Choudhury
Backdoor: Making Microphones Hear Inaudible Sounds: Nirupam Roy, Haitham Hassanieh, Romit Roy Choudhury
ABSTRACT
                                                                                                                                                a	
                                                                                                                                            Ong	 shadow	            Inaudible	
Consider sounds, say at 40kHz, that are completely out-                                                                                Crea                         tone	pair	
                                                                                                Amplitude	
side the human’s audible range (20kHz), as well as a mi-
crophone’s recordable range (24kHz). We show that these
high frequency sounds can be designed to become record-
able by unmodified microphones, while remaining inaudible
                                                                                                             Signal	inside	
to humans. The core idea lies in exploiting non-linearities                                                  microphone	                Microphone	
in microphone hardware. Briefly, we design the sound and                                                                                   filter	
play it on a speaker such that, after passing through the mi-                                                            10K	      20K	 24K	               40K	          50K	 Frequency	
crophone’s non-linear diaphragm and power-amplifier, the
                                                                                                                  Audible	sound	      Near	                Ultrasound	
signal creates a “shadow” in the audible frequency range.
                                                                                                                                   ultrasound	
The shadow can be regulated to carry data bits, thereby en-
abling an acoustic (but inaudible) communication channel to                                                  Figure 1: The main idea underlying BackDoor.
today’s microphones. Other applications include jamming                                         modification, enabling billions of phones, laptops, and IoT
spy microphones in the environment, live watermarking of                                        devices to leverage the capability. This paper presents Back-
music in a concert, and even acoustic denial-of-service (DoS)                                   Door, a system that develops the technical building blocks
attacks. This paper presents BackDoor, a system that de-                                        for harnessing this opportunity, leading to new applications
velops the technical building blocks for harnessing this op-                                    in security and communications.
portunity. Reported results achieve upwards of 4kbps for
proximate data communication, as well as room-level pri-                                         Security: Given microphones record these inaudible
vacy protection against electronic eavesdropping.                                               sounds, it should be possible to silently jam spy microphones
                                                                                                from recording. Military and government officials can se-
1.      INTRODUCTION                                                                            cure private and confidential meetings from electronic eaves-
This paper shows the possibility of creating sounds that hu-                                    dropping; cinemas and concerts can prevent unauthorized
mans cannot hear but microphones can record. This is not                                        recording of movies and live performances. We also realized
because the sound is too soft or just at the periphery of                                       the possibility of security threats. Denial-of-service (DoS)
human’s frequency range. The sounds we create are ac-                                           attacks on sound devices are typically considered difficult
tually 40kHz and above, completely outside both human’s                                         as the jammer can be easily detected. However, BackDoor
and microphone’s range of operation. However, given micro-                                      shows that inaudible jammers can disable hearing aids and
phones possess inherent non-linearities in their diaphragms                                     cellphones without getting detected. For example, during a
and power amplifiers, it is possible to design sounds that                                      robbery, the perpetrators can prevent people from making
exploit this property. To elaborate, we shape the frequency                                     911 calls by silently jamming all phones’ microphones.
and phase of sound signals and play them through ultra-                                          Communications: Ultrasound systems today aim to
sound speakers; when these sounds pass through the non-                                         achieve inaudible data transmissions to the microphone [34].
linear amplifier at the receiver, the high frequency sounds are                                 However, they suffer from limited bandwidth, around 3kHz,
expected to create a low-frequency “shadow”. The “shadow”                                       since they must remain above human hearing range (20kHz)
is within the filtering range of the microphone and thereby                                     and below the microphone’s cutoff frequency (24kHz).
gets recorded as normal sounds. Figure 1 illustrates the                                        Moreover, FCC imposes strict power restrictions on these
effect. Importantly, the microphone does not require any                                        bands since they are partly audible to infants and pets [20].
Permission to make digital or hard copies of all or part of this work for personal or           BackDoor is free of these limitations. Using an ultrasound-
classroom use is granted without fee provided that copies are not made or distributed           based transmitter, it can utilize the entire microphone spec-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the              trum for communication. Thus, IoT devices could find an
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or          alternative channel for communication, reducing the grow-
republish, to post on servers or to redistribute to lists, requires prior specific permission   ing load on Bluetooth (BLE). Museums and shopping malls
and/or a fee. Request permissions from permissions@acm.org.
                                                                                                could use acoustic beacons to broadcast information about
MobiSys ’17, June 19–23, 2017, Niagara Falls, NY, USA.
                                                                                                nearby art pieces or products. Various ultrasound ranging
c 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4928-4/17/06. . . $15.00
                                                                                                schemes, that compute time of flight of signals, could benefit
DOI: http://dx.doi.org/10.1145/3081333.3081366
                                                                                                from the substantially higher bandwidth in BackDoor.
This paper focuses on developing the technical primitives           In sum, this paper makes the following contributions:
that enable these applications. In the simplest case, Back-
Door plays two tones at say 40kHz and 50kHz. When these               • Exploits non-linearities in off-the-shelf microphones to
tones arrive together at the microphone’s power amplifier,              enable a “backdoor” from high to low frequencies. This
they are amplified as expected, but also multiplied due to              backdoor permits playback of high frequency sounds that
fundamental non-linearities in the system. Multiplication               are inaudible to humans and yet recordable through mi-
of frequencies f1 and f2 result in frequency components at              crophones.
(f1 − f2 ) and (f1 + f2 ). Given that (f1 − f2 ) is 10kHz in this     • Builds enabling primitives for applications in acoustic
case, well within the microphone’s range, the signal passes             communication and privacy. The acoustic radio outper-
unaltered through the low pass filter (LPF). Human ears,                forms today’s near-ultrasound systems, while jamming
on the other hand, do not exhibit such non-linearities and              raises the bar against eavesdropping.
completely filter out the 40kHz and 50kHz sounds.
                                                                    The subsequent sections expand on these contributions. We
While the above is a trivial case of sending a tone, Back-          begin with an acoustic primer, followed by intuitions, system
Door intends to load data on transmitted carrier signals and        design, and evaluation.
demodulate the “shadow” after receiving through the micro-
phone. This entails challenges. First, The non-linearities
we intend to exploit are not unique to the microphone; they
                                                                    2.          ACOUSTIC SYSTEMS PRIMER
are also present in speakers that transmit the sounds. As a
result, the speaker also produces a “shadow” within the audi-
                                                                    Common Microphone Systems
ble range, making its output audible to humans. We address          Any sound recording system requires two main modules
this by using multiple speakers and isolating the signals in        – a transducer and an analog-to-digital converter (ADC).
frequency across the speakers. We show, both analytically           The transducer contains a “diaphragm” that vibrates due to
and empirically, that none of these isolated sounds create a        sound pressure, producing a proportional change in voltage.
“shadow” as they pass through the speaker’s diaphragm and           The ADC measures this voltage variation (at a fixed sam-
amplifier. However, once these sounds arrive and combine            pling frequency) and stores the samples in memory. These
non-linearly inside the microphone, the “shadow” emerges            samples represent the recorded sound in the digital domain.
within the audible range.                                           A practical microphone needs two more components between
Second, for communication applications, standard modu-              the diaphragm and the ADC, namely a pre-amplifier and
lation and coding schemes cannot be used directly. Sec-             a low pass filter. Figure 2 shows the pipeline. The pre-
tion 4.1 shows how appropriate frequency-modulation, com-           amplifier’s task is to amplify the output of the transducer by
bined with inverse filtering, resonance alignment, and ring-        a gain of around 10× so that the ADC can measure the signal
ing mitigation are needed to boost achievable data rates.           effectively using its predefined quantization levels. Without
Finally, for security applications, jamming requires trans-         this amplification, the signal is too weak (around tens of
mitting noisy signals that cover the entire audible frequency       millivolts).
range. With audible jammers, this requires speakers to op-
erate at very high volumes. Section 4.2 describes how Back-                           Mic	 Pre-amp	 Low-pass	               ADC	
Door is designed to achieve equally effective jamming, but
in complete silence. We leverage the adaptive gain control
(AGC) in microphones, in conjunction with selective fre-
quency distortion, to improve jamming at modest power                                    Voltage	   Amplified	     Band-limited	                      Digital	
                                                                      Sound	              signal	
levels.                                                                                              signal	         signal	                        samples	
The final BackDoor prototype is built on customized ultra-                            Figure 2: The sound recording signal flow.
sound speakers and evaluated for both communication and
                                                                    As per Nyquist’s law, if the ADC’s sampling frequency is
security applications across different types of mobile devices.
                                                                    fs Hz, the sound must be band limited to f2s Hz to avoid
Our results reveal the following:
                                                                    aliasing and distortions. Since natural sound can spread over
 • 100 different sounds played to 7 individuals confirmed           a wide band of frequencies, it needs to be low pass filtered
   that BackDoor was completely inaudible.                          (i.e., frequencies greater than f2s removed) before the A/D
                                                                    conversion. Since ADCs in today’s microphones operate at
 • BackDoor attained data rates of 4 kbps at a distance of          48kHz, the low pass filters (LPFs) are designed to cut off
   1 meter, and 2 kbps at 1.5 meters – this is 2× higher in         signals at 24kHz. Figure 3 shows the effect of the low pass
   throughput and 5× higher in distance than systems that           (or anti-aliasing) filter on the recorded sound spectrum.
   use the near-ultrasound band.
                                                                                                                          ADC	
 • BackDoor is able to jam and prevent the recording of
   any conversation within a radius of 3.5 meters (and po-
                                                                    Ampl.	
   and a speech recognition software [2], less than 15% of the           Input	
   words were decoded correctly. Audible jammers, aiming               spectrum	             Low-pass	          fs/2	     ADC	                        fs/2	
   at comparable performance, would need to play white              Figure 3: The digital spectrum with and without the
   noise at a loudness of 97 dBSPL, considered seriously            (anti-aliasing) low-pass filter.
   harmful to human ears [19].
Sound Playback through Speakers                                   terms, however, is a multiplication of signals, resulting in
Sound playback is simply the reverse of recording. Given a        various frequency components, namely, 2ω1 , 2ω2 , (ω1 − ω2 ),
digital signal as input, the digital-to-analog converter (DAC)    and (ω1 + ω2 ). Mathematically,
produces the corresponding analog signal and feeds it to the                                                 1              1
speaker. The speaker’s diaphragm oscillates to the applied                           A2 (S1 + S2 )2 = 1 −      Cos(2ω1 t) − Cos(2ω2 t) +
                                                                                                             2              2
voltage producing varying sound pressures in the medium,                                                Cos((ω1 − ω2 )t) − Cos((ω1 + ω2 )t)
which is then audible to humans.
                                                                  With the microphone’s cut off at 24kHz, all of the above
Linear and Non-linear Behavior                                    frequencies in Sout get filtered out by the LPF, except
Modules inside a microphone are mostly linear systems,            Cos((ω1 − ω2 )t), which is essentially a 10kHz tone. The
meaning that the output signals are linear combinations of        ADC is oblivious of how this 10kHz signal was generated
the input. In the case of the pre-amplifier, if the input sound   and records it like any other sound signal. We call this the
is S, then the output can be represented by                       “shadow” signal. The net effect is that a completely inaudi-
                                                                  ble frequency has been recorded by unmodified off-the-shelf
                          Sout = A1 S                             microphones.
Here A1 is a complex gain that can change the phase and/or
amplitude of the input frequencies, but does not generate
                                                                  3.1                        Measurements and Validation
spurious new frequencies. This behavior makes it possible         For the above idea to work with unmodified off-the-shelf
to record an exact (but higher-power) replica of the input        microphones, two assumptions need validation. (1) The di-
sound and playback without distortion.                            aphragm of the microphone should exhibit some sensitivity
                                                                  at the high-end frequencies (> 30kHz). If the diaphragm
In practice, however, acoustic amplifiers maintain strong         does not vibrate at such frequencies, there is no opportu-
linearity only in the audible frequency range; outside this       nity for non-linear mixing of signals. (2) The second or-
range, the response exhibits non-linearity. The diaphragm         der coefficient A2 needs to be adequately high to achieve a
also exhibits similar behavior. Thus, for f > 25kHz, the          meaningful signal-to-noise ratio (SNR) for the shadow sig-
net recorded sound Sout may be expressed in terms of the          nal, while the third and fourth order coefficients (A3 , A4 )
input sound S as follows:                                         should be negligibly weak. We verify these next.
                  ∞
                                                                  (1) Sensitivity to High Frequencies: Figure 4 reports
                  X
                      Ai S i = A1 S + A2 S 2 + A3 S 3 + ...
           
     Sout     =
          f >25    i=1
                                                                  the results when a 60kHz sound was played through an ul-
                                                                  trasonic speaker and recorded with a programmable micro-
While in theory the non-linear output is an infinite power        phone circuit. To verify the presence of a response at this
series, the third and higher order terms are extremely weak       high frequency, we “hacked” the circuit using an FPGA kit,
and can be ignored. BackDoor finds opportunities to ex-           and tapped into the signal before it entered the low pass
ploit the second order term, which can be manipulated by          filter (LPF). Figure 4(a) shows the clear detection of the
designing the input signal S.                                     60kHz tone, confirming that the diaphragm indeed vibrates
                                                                  to ultrasounds. We also measured the channel frequency re-
3.   CORE INTUITION AND VALIDATION                                sponse at the output of the pre-amplifier (before the LPF):
As mentioned earlier, our core idea is to operate the mi-         Figure 4(b) illustrates the results. The take away message
crophone at high (inaudible) frequencies, thereby invoking        is that the analog components indeed operate at a much
the non-linear behavior in the diaphragm and pre-amplifier.       wider bandwidth; it is the digital domain that restricts the
This is counter-intuitive because most researchers and engi-      operating range.
neers strive to avoid non-linearity. In our case, however, we                      -40                                                       0
                                                                                                                         Magnitude (dBV)
intend to create an inlet into the audible frequency range                         -60                  (60KHz, -47dB)                      -20
                                                                  Power (dB/Hz)
-20
                                                                                            Magnitude (dB)
                  -20                                                                                                                                         Harmonics   Now, when this signal arrives at the microphone and passes
                                                                           2nd order
                                                                                                              -40                                                         through the non-linearities, the squared components of the
                  -40
                                                                                                              -60                                                         amplifier’s output will be:
                  -60                                                                                                                                                                                           	2
                                                                                                                                                                            2            
                                                                                                              -80                                                         Sout,AM  = A2 aSin(ωm t).Sin(ωc t)
                  -80
                                                                                                             -100                                                                          a2                                      	2
                 -100                                                                                                                                                              = −A2        Cos(ωc t − ωm t) − Cos(ωc t + ωm t)
                                                                                                             -120                                                                          4
                        0                          50                     100                                        0                      50                100
                                                                                                                                Frequency (KHz)
                                                                                                                                                                                           a2
                                             Frequency (KHz)                                                                                                                       = −A2 Cos(2ωm t) + (terms with f requencies
                                                                                                                                                                                           4
Figure 5: (a)The intermodulation distortion of sig-                                                                                                                                                                 above ωc and DC)
nal (b) Harmonic distortion.
                                                                                                                                                                          The result is a signal that contains a Cos(2ωm t) component.
3.2                              Hardware Generalizability                                                                                                                So long as ωm , the frequency of the data signal, is less than
Before concluding this section, we report measurements to                                                                                                                 10kHz, the corresponding shadow at 2ωm = 20kHz is within
confirm that non-linearities are present in different kinds of                                                                                                            the LPF cutoff. Thus, the received sound data can be band
hardware (not just a specific make or model). To this end,                                                                                                                pass filtered in software, and the data signal correctly de-
we played high frequency sounds and recorded them across a                                                                                                                modulated.
variety of devices, including smartphones (iPhone 5S, Sam-
                                                                                                                                                                          Importantly, the above phenomenon is reminiscent of
sung Galaxy S6), smartwatch (Samsung Gear2), video cam-
                                                                                                                                                                          coherent demodulation in conventional radios, where
era (Canon PowerShot ELPH 300HS), hearing aids (Kirk-
                                                                                                                                                                          the receiver would have multiplied the modulated sig-
land Signature 5.0), laptop (MacBook Pro), etc. Figure 6
                                                                                                                                                                          nal (aSin(ωm t)Sin(ωc t)) with the frequency and phase-
summarizes the SNR for the shadow signals for each of these
                                                                                                                                                                          synchronized carrier signal Sin(ωc t). The result would be
devices. The SNR is uniformly conspicuous across all the
                                                                                                                                                                          the m(t) signal in baseband, i.e., the carrier frequency ωc
devices, suggesting potential for widespread applicability.
                                                                                                                                                                          eliminated. Our case is somewhat similar – the carrier also
                                                                                                                                                                          gets eliminated, and the message signal appears at 2ωm (in-
                                                                                                                                                                          stead of ωm ). This is hardly a problem since the signal can
                    BackDoor	Signal	(dB)	
iPhone
Laptop
                                                                             10                                                     Input
                                                                  Ampl.(V)
Note that the first term from the FM modulated ωc signal,                      0                                                                                       0.5
and the second term from the ωs secondary carrier. Now,                      -10
upon arriving on the receiver, the microphone’s non-linearity
                                                                             10                                                     Output                               0
                                                                  Ampl.(V)
                                                                                                                                                                                          Ampl.(V)
                                                                                                                                                        -55
                                                                                                                                        Power (dB/Hz)
                                                                                                                                                                                                       0
                  = sin(ωc t + βsin(ωm t)) ∗ (k0 δ(t) + k1 δ(t − 1))                                                                                    -60
                                                                                                                                                                                                     -10
                                                                                                                                                        -65
                  =                   k0 sin(ωc t + βsin(ωm t))
                                                                                                                                                        -70                                          10                          Output
                                                                                                                                                                                          Ampl.(V)
                                      + k1 sin(ωc (t − 1) + βsin(ωm (t − 1)))                                                                           -75                                           0
While Sout contains only high frequency components (since                                                                                               -80                                          -10
                                                    2                                                                                                         20   40   60   80 100 120                    0   10       20      30        40
convolution is linear), the non-linear counterpart Sout mixes                                                                                                  Frequency (KHz)                                      Time (ms)
the frequencies in a way that has lower frequency compo-
                                                                                                                                        Figure 11: (a) Freq. response of the ultrasonic
nents (or shadows):
                                                                                                                                        speaker. (b) Inverse filtering method almost elimi-
     2                            ωm             ωm                                                                                     nates ringing effect compared to Figure 9
    Sout = k0 k1 cos(ωc + 2βsin(     )sin(ωm t −     ))
                                   2              2                                                                                                               .
         + (terms with f requencies over 2ωc and DC)                                                                                    Receiver Design
                                                                                                                                2
Figure 10 shows the spectrum of Sout and               with                                                                    Sout ,   This completes the transmitter design and the receiver is
and without the convolution. Observe the low frequency                                                                                  now an unmodified microphone (from off-the-shelf phones,
“shadow” that appear due to the second order term for the                                                                               cameras, laptops, etc.). Of course, to extract the data bits,
convolved signal – this shadow causes the ringing and is no-                                                                            we need to receive the output signal from the microphone
ticeable to humans.                                                                                                                     and decode them in software. For example, in smartphones,
                                                                                                                                        we have used the native recording app, and operated on the
                                                         1st	Order	                                          2nd	Order	
                                               0                                                    0
                                                                                                                                        stored signal output. The decoding steps are as follows.
       w/o		convolu7on	
                                              -50                                                  -50                                  We begin by band pass filtering the signal as per the mod-
                                                                                 Power (db/Hz)
                             Power (db/Hz)
                                             -20                                                   -20         audible	                 move the negative frequencies, we Hilbert Transform the
                                                                                  Power (db/Hz)
                          Power (db/Hz)
Power (dB/Hz)
                             (10KHz, -18dB)
                                                                                         (10KHz, -3dB)
                   -40                                                    -40                                      sation/tolerance. This is a key advantage of jamming with
                   -60                                                    -60                                      BackDoor. Nonetheless, we still attempt to lower the power
                   -80                                                    -80                                      requirement by injecting additional frequency distortions at
                  -100                                                   -100                                      the eavesdropper’s microphone.
                  -120                                                   -120
                         4    6       8        10   12                          4        6          8    10   12
                             Frequency (KHz)                                            Frequency (KHz)                                         4
                                                                                                                              Frequency (KHz)
played through the ultrasound speakers. Section 5 will re-         arrays, each array with 9 piezoelectric speakers connected
port results on word legibility, as a function of the separation   in parallel to generate a 2Watt jamming signal. The signals
between the jammer and the spy microphone.                         driving these arrays are first amplified using an LM380
                                                                   op-amp based power amplifier separately powered from a
5.    EVALUATION                                                   constant DC-voltage source. Figure 16 shows the circuit
                                                                   diagram of the speaker array.
BackDoor was evaluated on 3 main metrics: (1) human audi-
bility, (2) throughput, packet error rates(PER) and bit error
                                                                                                                                                                   +
rates (BER) for data communication, and (3) the efficacy of                       9-element ultrasonic
                                                                                     speaker array         1       2           …              8          9
Door for various frequencies, modulations, and SNR levels.                        Vin                                      +                  14       0.1µF
                                                                                                                                                   8
                                                                           Modulated 0.1µF     100KΩ                       LM380
Except for amplitude modulation (AM), all the human vol-                  input signal                                 6
                                                                                                                           -                                   10KΩ
unteers reported complete silence.                                                                     10KΩ
                                                                                                                                      3,4,5
                                                                                                                                   10,11,12
                                                                                                                           7
• Figure 17 and 18 report the variation of throughput against                                            0.1µF             GND          Heat
                                                                                                                                        sink
increasing distance, different phone orientations, and impact
of acoustic interference. The results show throughput of 4                                   GND   GND
                                                                                                                                   50KΩ
                                                                                                                                                             GND
kbps at 1 meter away which is 2× to 4× higher than today’s
mobile ultrasound communication systems.                           Figure 16: The circuit diagram of the jamming
• Figure 19 compares the jamming radius for BackDoor and           transmitter.
audible white noise-based jammers. To achieve the same
                                                                   (2) Receiver Microphones: We experiment with two
jamming effect (say, < 15% words legible by humans), we
                                                                   types of receivers. The first is an off-the-shelf Samsung
find that the audible jammer requires a loudness of 97 dB-
                                                                   Galaxy S6 smartphone (released in Aug, 2015) running An-
SPL which is similar to a jackhammer and can cause severe
                                                                   droid OS 5.1.1. Signals are recorded through a custom An-
damage to humans [19]. BackDoor, on the other hand, re-
                                                                   droid app using the standard APIs. The second receiver
mains completely silent. Conversely, when the white noise
                                                                   is shown in Figure 15(c) – a more involved setup that was
sound level is made tolerable, the legibility of the words was
                                                                   mainly used for micro-benchmarks reported earlier in Sec-
76%.
                                                                   tions 3 and 4. This allowed us to tap into different com-
We elaborate on these results below, starting with details         ponents of the microphone pipeline, and analyze signals in
on our implementation platform.                                    isolation. The system runs on a high bandwidth data ac-
                                                                   quisition ZedBoard, a Xilinx Zynq-7000 SoC based FPGA
5.1    Implementation                                              platform [12], that offers a high-rate internal ADC (up to 1
(1) Transmitter Speakers: Figure 15(a) and (b) show                Msample/sec). A MEMS microphone (ADMP 401) is exter-
two different transmitter prototypes we have developed,            nally connected to this ADC, offering undistorted insights
the first one for communication and the other for jamming.         into higher frequency bands of the spectrum.
The communication transmitter consists of two ultrasonic
piezoelectric speakers [33]; each transmits a separate
frequency as described in Section 4. A programmable                5.2   Human Audibility Results
waveform generator (Keysight 33500b series) drives the             We played BackDoor signals to a group of 7 users (ages be-
speakers with frequency modulated signals. The signals are         tween 27 and 38) seated around a table 1 to 3 meters away
amplified using an NE5535AP op-amp based non-inverting             from the speakers. Each user reported the perceived loud-
amplifier, permitting signals up to 150kHz. The jamming            ness of the sound on a scale of 0-10, with 0 being perceived
transmitter in Figure 15(b) is composed of two speaker             silence. As a baseline, we also played audible sounds and
              Reference Mic.        2kHz Tone                                         5kHz Tone                                  FM                                          AM              White Noise
                SNR (dB)        BackDoor Audible                                  BackDoor Audible                        BackDoor Audible                            BackDoor Audible    BackDoor Audible
                    25             0        0.75                                     0        3.33                           0       1.2                                  0     0.46         0         0.1
                    30             0         1.5                                     0        4.08                           0       2.3                                 0.1    1.36         0        0.26
                    35             0          2                                      0        4.91                           0       3.5                                 0.1    1.85         0         0.5
                    40             0        2.67                                     0        5.42                           0       4.2                                0.16     2.4         0         0.8
                    45             0        3.17                                     0        6.17                           0       4.8                                0.68    3.06         0        1.24
                    5                                                             5                                                                    12                                 Secondary	       Y
                                       Coding rate: 3/4                                                                                                                   Primary mic
                                                              Throughput (Kbps)
                                                                                                                                                                                             mic	
                                                                                  3                                                                     8
                    3
                                                                                                                                                                                             -X	     Z
                                                                                  2                                                                     6                                                      X	
                    2
                                                                                  1                                                                     4
                    1
                                                                                  0                                                                     2
                                                                                                   t           r      i     r
                                                                                        irp     ur          pe     an     oo                            0                                 Primary	
                    0                                                                 Ch      Bl       hi
                                                                                                         s       w   D
                    0.5   1            1.5                2
                                                                                                 iW            Dh ack                                       Y   -Y     X    -X   Z   -Z     mic	
                          Distance (meter)                                                    Pr                   B                                                 Orientation                     -Y	
Figure 17: BackDoor Communication Results: (a) Throughput vs. Distance, (b) Throughput comparison
against related P2P communication schemes. (c) Packet error rate vs. Orientation. (d) Phone orientations.
asked the users to report the loudness levels. A reference                                                                      we avoid AM, BackDoor signals remain inaudible to hu-
microphone is placed at 1m from the speaker to record and                                                                       mans but produce audible signals inside microphones with
compute the SNR (Signal to Noise Ratio) of all the tested                                                                       the same SNR as loud audible signals.
sounds. We varied the SNR and equalized them at the mi-
crophone for fair comparison between audible and inaudible                                                                      5.3                         Communication Results
(BackDoor) sounds.                                                                                                              The BackDoor transmitter is the 2-speaker system while the
                                                                                                                                receiver is the Samsung smartphone. The recorded acoustic
Four types of signals were played: (1) Single Tone Un-
                                                                                                                                signal is extracted and processed in MATLAB; we compute
modulated Signals: In the simplest form, we transmitted
                                                                                                                                bit error rate (BER), packet error rate (PER) and through-
multiple pairs of ultrasonic tones (<40, 42> and <40, 45>)
                                                                                                                                put under varying parameters. Overall, 40 hours of acoustic
that generate a single audible frequency tone in the micro-
                                                                                                                                transmission was performed to generate the results.
phone. As baseline, we separately played a 2kHz and 5kHz
audible tone. (2) Frequency Modulated Signals: We
modulated the frequency of a 40kHz primary carrier with
                                                                                                                                Throughput
a 3kHz signal. We also transmitted a 45kHz secondary car-                                                                       Figure 17(a) reports BackDoor’s net end-to-end through-
rier on the second speaker, producing 3kHz FM signal cen-                                                                       put for increasing separation between the transmitter and
tered at 5kHz in the microphone. As baseline, we played                                                                         the receiver. BackDoor can achieve a throughput of 4kbps
the equivalent audible FM signal on the same speakers. (3)                                                                      at 1m, 2kbps at 1.5m and 1kbps at 2m. Figure 17(b)
Amplitude Modulated Signals: Similar to FM signals,                                                                             compares BackDoor’s performance in terms of throughput
we created these AM signals by modulating the amplitude of                                                                      and range with state-of-the-art mobile acoustic communica-
40kHz signal with a 3kHz tone. (4) White Noise Signals:                                                                         tion systems (in both commercial products [1, 13] and re-
Finally, we generated white Gaussian noise with zero mean                                                                       search [34, 22]). The figure shows that BackDoor achieve 2×
and variance proportional to the transmitted power, at a                                                                        to 80× higher throughput. This because these systems are
bandwidth of 8kHz, band-limited to [40, 48]kHz. We also                                                                         constrained to a very narrow communication band whereas
transmit a 40kHz tone on the second speaker to frequency                                                                        BackDoor is able to utilize the entire audible bandwidth.
shift the white noise to the audible range of the speaker.
As baseline, we create audible white noise with the same                                                                        Impact of Phone Orientation
properties band-limited to [0, 8]kHz and played it on the                                                                       Figure 17(c) shows the packet error rate (PER) when data
speakers.                                                                                                                       is decoded by the primary and secondary microphones in
                                                                                                                                the phone, placed in 6 different orientations (shown in Fig-
Audibility Vs. SNR                                                                                                              ure 17(d)). The aim here is to understand how real-world
Table 1 summarizes the average of perceived loudness that                                                                       use of the phone impacts data delivery. To this end, the
users reported for both BackDoor and audible signals as a                                                                       phone was held at a distance of 1m away from the trans-
function of the SNR measured at the reference microphone.                                                                       mitter, and the orientation changed after each transmission
For all types of signals except amplitude modulation (AM),                                                                      session. The plot shows that except Y and −Y , the other
BackDoor is completely inaudible to all the users. AM sig-                                                                      orientations are comparable. This is because the Y / − Y
nals are audible due to speaker non-linearity, as described                                                                     orientation align the two receivers and transmitters in al-
earlier. However, the perceived loudness of BackDoor is sig-                                                                    most a straight line, resulting in maximal SNR difference.
nificantly lower than that of audible signals. Thus, so long                                                                    Hand blockage of the further-away microphone makes the
SNR gap pronounced. It should be possible to compare the                                      for Bob, and the words played are derived from Google’s
SNR at the microphones and select the better microphone                                       Trillion Word Corpus [10]; we pick the 2000 most frequent
for minimized PER (regardless of the orientation).                                            words, prescribed as a good benchmark [35]. As mentioned
                                                                                              earlier, the volume of this playback is set to 70 dBSPL at
Impact of Interference                                                                        1m away. Now, the BackDoor prototype plays an inaudible
Figure 18(a) reports the bit error rate (BER) variation                                       jamming signal through its ultrasonic speakers to jam these
against 3 different audible interference sources. To elabo-                                   speech signals.
rate, we played audible interference signals – a presidential                                  Baseline: Our baseline comparison is essentially against
speech, an orchestral music, and white noise – from a nearby                                  audible white noise-based jammers in today’s markets. As-
speaker, while the data transmission was in progress. The                                     suming BackDoor jams up to a radius of R, we compute the
intensity of the interference at the microphone was at 70                                     loudness needed by white noise to jam the same radius. All
dBSPL, equaling the level of volume one hears on average                                      in all, 14 hours of sound was recorded and a total of 25, 000
in face-to-face conversations. This is certainly much louder                                  words were tested. The ASR software is the open-source
than average ambient noise, and hence, this serves as a                                       Sphinx4 library (pre-alpha version) published by CMU [2,
strict test for BackDoor’s resilience to interference. Also, the                              21]. We present the results next.
smartphone receiver was placed 1m away from the speaker,
and transmissions were at 2kbps and 4kbps.                                                    Audible and Inaudible Jamming Radius
Evident from the graph, voice and music has minimal im-                                       Figure 19(a) plots Lasr and Lhuman for increasing jamming
pact on the communication error. On the other hand, white                                     radius. Even with a 1W power, a radius of 3.5m (around
noise can severely degrade performance. Figure 18(b) plots                                    11 feet) can be jammed around Bob. We compare against
the power spectral density of each interference – the de-                                     audible noise jammers presented in Figure 19(b). For jam-
cay beyond 4kHz for voice and music explains the per-                                         ming at the same radius of 3.5m, the loudness necessary for
formance plots. Put differently, since BackDoor operates                                      the audible white noise is 97 dBSPL which is the same as a
around 10kHz frequency, voice and music signals do not af-                                    jackhammer and can cause damage to the human ear [19].
fect the band as much as white noise, that remains flat over                                  Conversely, we find that when the audible white noise is
the entire spectrum.                                                                          made tolerable (comparable to a white noise smartphone
                                                                                              app playing at full volume), the legibility becomes 76%.
      0.25       Bit-rate: 2K                                   -40                           Thus, BackDoor is a clear improvement over audible jam-
                 Bit-rate: 4K
       0.2                                                                                    mers. More importantly, increasing the power of BackDoor
                                                 PSD (dB/Hz)
100
                                                                                                                                                                                      Accuracy (%)
               60                                                                                                                                                                                    60                                  0.6
                                                                                                                                                                                                                                   CDF
                                                                                                                   Vacuum	cleaner	
                                                                                                                                        Audible	jammer	
               40                                                                                                                                                                                    40                                  0.4                            No Jamming
                                                                                                                                                          Jackhammer	
                                                                                                  (Full	volume)	
                                                                                           50	                                                                                                                                                                          Dist: 5m
                                                                                                                                                                        Jet	engine	
                                                                                                     Laptop		
                                                                                                                                                                                                                                         0.2                            Dist: 4m
               20                                                                                                                                                                                    20                                                                 Dist: 3m
                                                                                                                                                                                                                                                                        Dist: 2m
               0                                                                            0	                                                                                                       0                                    0
                                                                                                                                     Sound	sources	                                                                                            0   0.2      0.4   0.6       0.8      1
                                                             m
                    0m
                         5m
                              0m
                                   5m
                                        0m
                                             5m
                                                  0m
                                                           5m
N .0m
                                                                                                                                                                                                                m
                                                                                                                                                                                                              0m
                                                                                                                                                                                                              5m
                                                                                                                                                                                                              0m
                                                                                                                                                                                                              5m
                                                                                                                                                                                                              0m
                                                                                                                                                                                                              5m
                                                                                                                                                                                                              0m
                                                                                                                                                                                                              5m
                                                                                                                                                                                                          N .0m
                                                                                                                                                                                                                                                         Confidence score
                                                          Ja
                                                                                                                                                                                                             Ja
                    1.
                         1.
                              2.
                                   2.
                                        3.
                                             3.
                                                  4.
                                                       4.
                                                                                                                                                                                                            1.
                                                                                                                                                                                                            1.
                                                                                                                                                                                                            2.
                                                                                                                                                                                                            2.
                                                                                                                                                                                                            3.
                                                                                                                                                                                                            3.
                                                                                                                                                                                                            4.
                                                                                                                                                                                                            4.
                                                         5
                                                                                                                                                                                                            5
                                                        o
                                                                                                                                                                                                           o
Figure 19: Jamming results: (a) BackDoor jams a radius of 3.5m at 2W power. (b) White noise power needed
to match BackDoor is intolerable. (c) Jamming radius when BackDoor uses inaudible white noise, showing
importance of selectively jamming voice-centric harmonics. (d) Confidence of speech recognizer.
• Smarter Spy: We have assumed a fairly simple attacker                                                                                                                                               cure data exchange medium. GhostTalk [26] explores vari-
planting a single microphone in the vicinity. Multiple mi-                                                                                                                                            ous attack scenarios on the consumer electronics using high
crophones, perhaps even with various beamforming capabil-                                                                                                                                             power electromagnetic interference. Another thread of re-
ities, may be able to extract out the voice from the jamming                                                                                                                                          cent work has looked into watermarking audio-visual me-
signal. However, greater sophistication in jamming should                                                                                                                                             dia. Dolphin [41] enables speaker-microphone communica-
be feasible too, such as variation in the jamming signal to                                                                                                                                           tion by embedding data bits on the sound. It adapts the
prevent channel estimation; even some movements of the                                                                                                                                                signal parameters in real-time to keep the embedded signal
speakers. We leave this to future work.                                                                                                                                                               imperceptible to human ears while achieving 500 bps data
                                                                                                                                                                                                      rate. Kaleido[48] proposes a video precoding based solution
• Interference with Phone Calls: Data communica-                                                                                                                                                      to prevent videotaping an on-screen show in a theater or
tion with BackDoor can interfere with people talking on the                                                                                                                                           on website. It precodes distortions in the video such that
phone nearby. To this end, data communication applications                                                                                                                                            it is invisible to humans but severely distorts videotaping
will inherently need to be proximate and at low power. One                                                                                                                                            (due to specific limitations of the camera). Finally, sound
possibility is an acoustic NFC, but at greater ranges of 1                                                                                                                                            maskers have also been used for protecting private conversa-
or 2 feet. Alternatively, the communication could be made                                                                                                                                             tion, however, these techniques have been limited to audible
spread spectrum so that the interference remains below the                                                                                                                                            frequencies [18, 30, 6, 7]. BackDoor differs from the above
noise floor. Our ongoing work is investigating these unre-                                                                                                                                            in the sense that it exploits discrepancies between humans
solved issues.                                                                                                                                                                                        and electronics, ultimately enabling a new capability to the
                                                                                                                                                                                                      best of our knowledge.
7.                   RELATED WORK
 Literature in Acoustic Non-linearity: The litera-                                                                                                                                                   8.     CONCLUSION
ture in acoustic signal processing and communication is ex-
tremely rich. The notion of exploiting non-linearity was orig-                                                                                                                                        Device non-linearity has been conventionally viewed as a
inally studied in the 1957 by Westervelt’s seminal theory                                                                                                                                             peril. This paper breaks away from this point of view and
[43, 42], which later triggered a series of research. The core                                                                                                                                        discovers various opportunities to harness non-linearity. By
vision was that non-linearities of the air can naturally self-                                                                                                                                        carefully designing ultrasound signals, we demonstrate that
demodulate signals; when combined with directional prop-                                                                                                                                              such signals remain inaudible to humans but are record-able
agation of ultrasound signals, it may be possible to deliver                                                                                                                                          by unmodified off-the-shelf microphones. This translates to
audible information over large distances using relatively low                                                                                                                                         new applications including inaudible data communication,
power [17, 14, 46]. Recently, there has been a revival of                                                                                                                                             privacy, and acoustic watermarking. While our ongoing
these efforts with AudioSpotlight [5], SoundLazer [9, 8], and                                                                                                                                         work is focused on deeper understanding of these capabil-
other projects [47, 11, 36]. Our work, however, is opposite                                                                                                                                           ities and applications, our longer term goal is focused on
of these efforts – we are attempting to retain the inaudi-                                                                                                                                            generalization to other platforms, such as wireless radios and
ble nature of ultrasound while making it recordable inside                                                                                                                                            inertial sensors.
electronic circuits.
 Medical Devices: Human bones have also been shown
                                                                                                                                                                                                      Acknowledgement
to exhibit non-linearities that self-modulate signals, result-                                                                                                                                        We sincerely thank the anonymous reviewers for their valu-
ing in applications in bone conduction ultrasound hearing                                                                                                                                             able feedback. We are grateful to the Joan and Lalit Bahl
aids for severely deaf individuals [28, 15, 16, 37, 32]. Even                                                                                                                                         Fellowship, Qualcomm, IBM, and NSF (award number:
bone conduction headphones are being considered that ex-                                                                                                                                              1619313) for partially funding this research.
ploit similar non-linearities [24].
 Assorted Topics Related to BackDoor: A set of
                                                                                                                                                                                                      9.     REFERENCES
recent works bear some degree of relevance to BackDoor.                                                                                                                                                   [1] Chirp technology. http://www.chirp.io. Last accessed
Dhwani [34] explores in-air sound signals as a short range,                                                                                                                                                   28 November 2016.
ad-hoc data transfer modality. Chirp [1] and Zoosh [39, 13]                                                                                                                                               [2] Cmu sphinx. http://cmusphinx.sourceforge.net. Last
have rolled out commercial products using sound for a se-                                                                                                                                                     accessed 6 December 2015.
 [3] Hight power bluetooth speaker: 12watt.                   [24] Kim, S., Hwang, J., Kang, T., Kang, S., and
     https://www.cnet.com/products/jbl-pulse/specs/.               Sohn, S. Generation of audible sound with ultrasonic
     Last accessed 28 November 2016.                               signals through the human body. In Consumer
 [4] Hight power bluetooth speaker: 38watt.                        Electronics (ISCE), 2012 IEEE 16th International
     http://www.fugoo.com/fugoo-tough-xl/. Last accessed           Symposium on (2012), IEEE, pp. 1–3.
     28 November 2016.                                        [25] Kumar, S., and Furuhashi, H. Long-range
 [5] Holosonics webpage. https://holosonics.com. Last              measurement system using ultrasonic range sensor
     accessed 28 November 2016.                                    with high-power transmitter array in air. Ultrasonics
 [6] Sound masking device.                                         74 (2017), 186–195.
     http://www.oeler.com/sound-masking-systems/. Last        [26] Kune, D. F., Backes, J., Clark, S. S., Kramer,
     accessed 28 November 2016.                                    D., Reynolds, M., Fu, K., Kim, Y., and Xu, W.
 [7] Sound masking solutions.                                      Ghost talk: Mitigating emi signal injection attacks
     https://www.speechprivacysystems.com. Last accessed           against analog sensors. In Security and Privacy (SP),
     28 November 2016.                                             2013 IEEE Symposium on (2013), IEEE, pp. 145–159.
 [8] Soundlazer kickstarter. https://www.kickstarter.com/     [27] Lee, E. A., and Messerschmitt, D. G. Digital
     projects/richardhaberkern/soundlazer. Last accessed           communication. Springer Science & Business Media,
     28 November 2016.                                             2012.
 [9] Soundlazer webpage. http://www.soundlazer.com.           [28] Lenhardt, M. L., Skellett, R., Wang, P., and
     Last accessed 28 November 2016.                               Clarke, A. M. Human ultrasonic speech perception.
[10] Top 10000 words from google’s trillion word corpus.           Science 253, 5015 (1991), 82–85.
     https://github.com/first20hours/google-10000-english.    [29] Lyons, R. G. Understanding Digital Signal
     Last accessed 6 December 2015.                                Processing, 3/E. Pearson Education India, 2004.
[11] Woody norris ted talk.                                   [30] McCalmont, A. M. Voice privacy system with
     https://www.ted.com/speakers/woody norris. Last               amplitude masking, Mar. 25 1980. US Patent
     accessed 28 November 2016.                                    4,195,202.
[12] Zedboard. http://zedboard.org. Last accessed 28          [31] Mercy, D. A review of automatic gain control theory.
     November 2016.                                                Radio and Electronic Engineer 51, 11.12 (1981),
[13] Zoosh technology.                                             579–590.
     http://www.bdti.com/insidedsp/2011/07/28/naratte.        [32] Nakagawa, S., Okamoto, Y., and Fujisaka, Y.-i.
     Last accessed 28 November 2016.                               Development of a bone-conducted ultrasonic hearing
[14] Bjørnø, L. Parametric acoustic arrays. In Aspects of          aid for the profoundly sensorineural deaf. Transactions
     Signal Processing. Springer, 1977, pp. 33–59.                 of Japanese Society for Medical and Biological
[15] Deatherage, B. H., Jeffress, L. A., and                       Engineering 44, 1 (2006), 184–189.
     Blodgett, H. C. A note on the audibility of intense      [33] Nakamura, T. Piezoelectric speaker, June 3 1986. US
     ultrasonic sound. The Journal of the Acoustical               Patent 4,593,160.
     Society of America 26, 4 (1954), 582–582.                [34] Nandakumar, R., Chintalapudi, K. K.,
[16] Dobie, R. A., and Wiederhold, M. L. Ultrasonic                Padmanabhan, V., and Venkatesan, R. Dhwani:
     hearing. Science 255, 5051 (1992), 1584–1585.                 secure peer-to-peer acoustic nfc. In ACM SIGCOMM
[17] Fox, C., and Akervold, O. Parametric acoustic                 Computer Communication Review (2013), vol. 43,
     arrays. The Journal of the Acoustical Society of              ACM, pp. 63–74.
     America 53, 1 (1973), 382–382.                           [35] Nation, P., and Waring, R. Vocabulary size, text
[18] Goubran, R., and Botros, R. Adaptive sound                    coverage and word lists. Vocabulary: Description,
     masking system and method, June 5 2003. US Patent             acquisition and pedagogy 14 (1997), 6–19.
     20,030,103,632.                                          [36] NORRIS, E. Parametric transducer and related
[19] Hamby, W. Ultimate sound pressure level decibel               methods, May 6 2014. US Patent 8,718,297.
     table, 2004.                                             [37] Okamoto, Y., Nakagawa, S., Fujimoto, K., and
[20] Heffner, H. E., and Heffner, R. S. Hearing                    Tonoike, M. Intelligibility of bone-conducted
     ranges of laboratory animals. Journal of the American         ultrasonic speech. Hearing research 208, 1 (2005),
     Association for Laboratory Animal Science 46, 1               107–113.
     (2007), 20–22.                                           [38] Pérez, J. P. A., Pueyo, S. C., and López, B. C.
[21] Huggins-daines, D., Kumar, M., Chan, A.,                      Agc fundamentals. In Automatic Gain Control.
     Black, A. W., Ravishankar, M., and Rudnicky,                  Springer, 2011, pp. 13–28.
     A. I. Pocketsphinx: A free, real-time continuous         [39] Sherif, M. H. Protocols for secure electronic
     speech recognition system for hand-held devices. In in        commerce. CRC press, 2016.
     Proceedings of ICASSP (2006).                            [40] Tretter, S. A. Communication System Design Using
[22] Iannucci, P. A., Netravali, R., Goyal, A. K.,                 DSP Algorithms: With Laboratory Experiments for the
     and Balakrishnan, H. Room-area networks. In                   TMS320C6713TM DSK. Springer Science & Business
     Proceedings of the 14th ACM Workshop on Hot Topics            Media, 2008.
     in Networks (2015), ACM, p. 9.                           [41] Wang, Q., Ren, K., Zhou, M., Lei, T.,
[23] Jacobs, I. M., and Wozencraft, J. Principles of               Koutsonikolas, D., and Su, L. Messages behind
     communication engineering.                                    the sound: real-time hidden acoustic signal capture
                                                                   with smartphones. In Proceedings of the 22nd Annual
       International Conference on Mobile Computing and         [46] Yang, J., Tan, K.-S., Gan, W.-S., Er, M.-H., and
       Networking (2016), ACM, pp. 29–41.                            Yan, Y.-H. Beamwidth control in parametric acoustic
[42]   Westervelt, P. J. The theory of steady forces                 array. Japanese Journal of Applied Physics 44, 9R
       caused by sound waves. The Journal of the Acoustical          (2005), 6817.
       Society of America 23, 3 (1951), 312–315.                [47] Yoneyama, M., Fujimoto, J.-i., Kawamo, Y., and
[43]   Westervelt, P. J. Scattering of sound by sound.               Sasabe, S. The audio spotlight: An application of
       The Journal of the Acoustical Society of America 29, 2        nonlinear interaction of sound waves to a new type of
       (1957), 199–203.                                              loudspeaker design. The Journal of the Acoustical
[44]   Whitlow, D. Design and operation of automatic gain            Society of America 73, 5 (1983), 1532–1536.
       control loops for receivers in modern communications     [48] Zhang, L., Bo, C., Hou, J., Li, X.-Y., Wang, Y.,
       systems. Microwave Journal 46, 5 (2003), 254–269.             Liu, K., and Liu, Y. Kaleido: You can watch it but
[45]   Xiong, F. Digital modulation techniques. Artech               cannot record it. In Proceedings of the 21st Annual
       House, 2006.                                                  International Conference on Mobile Computing and
                                                                     Networking (2015), ACM, pp. 372–385.