Unit 2
Sound/Audio System
                            syllabus
Concept of sound system, Music and speech, speech analysis, speech
                         transformation
                                         Prepared by: Er. Hemanta Bohora
Introduction
Sound is a type of energy that travels through matter in waves, usually through air, but
also through liquids and solids. At its core, sound is produced when an object vibrates,
causing the surrounding molecules to vibrate as well. This vibration travels in waves until
it reaches a listener's ear, where it’s perceived as sound.
• Definition: Sound is energy that moves in waves through a medium like air, water, or solids.
• Frequency: Determines pitch; higher frequency means a higher pitch, and lower frequency
  means a lower pitch
• Creation: It’s generated by vibrations in an object, which cause surrounding molecules to vibrate
  and transmit the sound wave.
• Amplitude: Relates to volume; higher amplitude results in louder sounds, while lower amplitude
  results in quieter sounds.
• Human Perception: Sound waves reach the ear, causing the eardrum to vibrate, which the
  brain interprets as sound.
• Medium Requirement: Sound needs a medium (like air or water) to travel; it cannot move
  through a vacuum.
• Field of Study (Acoustics): Acoustics is the science of sound, exploring its production,
  transmission, and effects in various applications.
.
.
.
Speech Generation
• Speech Generation is the process of creating spoken language from
  text or other forms of input.
• It encompasses technologies like text-to-speech (TTS) systems,
  conversational AI, and voice cloning, and is critical in applications such
  as virtual assistants, accessibility tools, and robotics.
• The field has seen significant advancements with the introduction of
  neural network-based models like WaveNet and Tacotron, which
  produce highly natural, expressive speech.
• Challenges remain in making speech generation more natural, context-
  sensitive, and scalable, especially for multilingual or emotionally varied
  speech. Despite these challenges, speech generation continues to
  improve, enhancing human-computer interaction and accessibility
  across numerous industries
Techniques Used in Speech Generation
1.Concatenative Synthesis:
   1. A popular method for generating speech where prerecorded human speech segments (like
      syllables or phonemes) are stitched together to form sentences.
   2. Limitations: While this approach can produce highly natural-sounding speech, it requires a
      large database of recorded voice data and may sound robotic if not done correctly.
2.Parametric Synthesis:
   1. Speech generation models that create audio waves based on parameters like pitch,
      duration, and voice quality. Formant synthesis and HMM-based synthesis are examples of
      parametric synthesis methods.
   2. Limitations: Though efficient, the speech generated is often less natural-sounding
      compared to concatenative synthesis.
3.Neural Network-Based Models:
   1. WaveNet (by DeepMind): A deep neural network that directly generates raw audio waves, achieving much
      higher naturalness in speech than traditional methods.
   2. Tacotron and Tacotron 2: Text-to-speech systems that convert text into spectrograms, which are then turned
      into speech waveforms using another model (such as a vocoder).
   3. Prosody Prediction: Neural networks are also employed to predict and generate natural-sounding prosody
      (intonation, pitch, rhythm) in generated speech.
4. End-to-End Models:
    1. Recent advances involve end-to-end neural models that can directly generate
       speech from text without the need for separate steps like phonetic transcription,
       linguistic analysis, or waveform generation. These models learn the entire process
       in one unified pipeline.
    2. Example: FastSpeech and FastSpeech 2 are end-to-end systems that produce
       speech with high efficiency and naturalness.
5. Voice Cloning:
    1. Voice cloning involves generating speech that mimics a specific person’s voice using
       a limited amount of recorded data. This is typically done through deep learning
       models that capture the unique characteristics of a person’s voice and speech
       patterns.
    2. Applications: Personalized virtual assistants, content creation, and accessibility
       tools.
Basic Notations:
- The lowest periodic spectral component of the speech signal is called the
fundamental frequency. It is presented in a voiced sound.
- A phone is the smallest speech unit, such as the m of mat and b of bat in
English.
- Allophones mark the variants of phone. For example, the aspirated p of pit and
the unaspirated p of spit are allophones of the English phenome P.
- The morph marks the smallest unit which carries a meaning itself.
- A voiced sound is generated through the vocal cords; m, v and l are examples
of voiced sound. The pronunciation of a voiced depends strongly on each
speaker.
- During the generation of unvoiced sound, the vocal cords are opened f and s is
unvoiced sound.
- Vowels – a speech created by the relatively free passage of breath through the
larynx and oral charity. Example, a, e, I, o and u
- Consonants – a speech sound produced by a partial or complete obstruction of
the air stream by any of the various contradictions of the speech organs.
Example, m from mother, ch from chew.
Speech Analysis:
Speech analysis/input deals with the research areas which are as follows:
(1) Who?
-Human speech has certain characteristics determined by a speaker. Hence
speech analysis can serve to analyze who is speaking i.e. to recognize a
speaker for his/her identification and verification.
(2) What?
- Another main task of speech analysis is to analyze what has been said i.e. to
recognize and understand the speech signal itself.
(3) How?
Another area of speech analysis tries to research speech patterns with respect
to how a certain statement was said.
Figure: - Speech recognition system: task division into system components, using the
basic principle “Data Reduction through Property Extraction”
Speech Transmission:
- The area of speech transmission deals with efficient coding of the speech
signal allow speech / sound transmission at low transmission rates over
networks.
- The goal is to provide the receiver with the same speech/sound quality as was
generated at the sender side.
Some Techniques for Speech Transmission:
(1) Pulse Code Modulation:
A straight forward technique for digitizing an analog signal is pulse code
modulation. It meets the right quality demand stereo audio signals in the data rate
used for CD. Its rate is 176400 bytes/s.
(2) Source Encoding:
Figure: - Component of a speech transmission system using source encoding
In source encoding transmission depends on the original signal has certain characteristics that can
be exploited in compression.
(3) Recognition-Synthesis Method:
           Figure: - Component of a recognition Synthesis for speech transmission
This method conducts a speech analysis and speech synthesis during reconstruction speech
elements are characterized by bits and transmitted over multimedia system. The data rate defines
the quality.
Example:
Calculate the file size in bytes for 60 second recording at 44.1 KHz, 8 bits resolution stereo
sound.
Sound Types and their number of Channel