Di He

Di He

Kirkland, Washington, United States
656 followers 500+ connections

About

Research Engineer at Novumind focusing in speech application and hardware friendly Deep…

Activity

Join now to see all activity

Experience

  • Amazon Graphic

    Amazon

    Great Seattle Area

  • -

    Greater Seattle Area

  • -

    Santa Clara

  • -

    Santa Clara

  • -

    Urbana-Champaign, Illinois Area

  • -

    Urbana-Champaign, Illinois Area

  • -

    Greater Chicago Area

  • -

    Austin, Texas Area

  • -

    Austin, Texas Area

Education

Publications

  • Turn-taking and backchannel prediction with acoustic and large language model fusion

    ICASSP

    We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and…

    We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

    See publication
  • Two-Pass Endpoint Detection for Speech Recognition

    ASRU

    Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs…

    Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

    See publication
  • Personalized predictive asr for latency reduction in voice assistants

    Interspeech

    Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech…

    Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.

    See publication
  • VADOI: Voice-activity-detection overlapping inference for end-to-end long-form speech recognition

    ICASSP

    While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over-lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However…

    While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over-lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost.

    See publication
  • Wav2vec-C: A Self-supervised Model for Speech Representation Learning

    Interspeech

    Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The…

    Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0.

    See publication
  • Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

    The Journal of the Acoustical Society of America

    Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be…

    Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, experiments are conducted on the TIMIT corpus, with both Gaussian mixture model (GMM) and deep neural network (DNN)-based ASR systems, and it is found that frames containing landmarks are more informative for ASR than others. It is discovered that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of the hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy outperforms other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.

    Other authors
    See publication
  • Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

    Interspeech

    Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors…

    Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.

    Other authors
  • Machine learning on FPGAs to face the IoT revolution

    ICCAD

    FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. High-level synthesis (HLS) is an effective design flow for DNNs due to improved productivity, debugging, and design space exploration ability. However, optimizing large neural networks under resource constraints for FPGAs is still a key challenge. In this paper, we present a series of effective design techniques for…

    FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. High-level synthesis (HLS) is an effective design flow for DNNs due to improved productivity, debugging, and design space exploration ability. However, optimizing large neural networks under resource constraints for FPGAs is still a key challenge. In this paper, we present a series of effective design techniques for implementing DNNs on FPGAs with high performance and energy efficiency. These include the use of configurable DNN IPs, performance and resource modeling, resource allocation across DNN layers, and DNN reduction and re-training. We showcase several design solutions including Long-term Recurrent Convolution Network (LRCN) for video captioning, Inception module for FaceNet face recognition, as well as Long Short-Term Memory (LSTM) for sound recognition. These and other similar DNN solutions are ideal implementations to be deployed in vision or sound based IoT applications.

    Other authors
    See publication
  • Using Approximated Auditory Roughness as a Pre-filtering Feature for Human Screaming and Affective Speech AED

    Interspeech

    Detecting human screaming, shouting, and other verbal manifestations
    of fear and anger are of great interest to security Audio
    Event Detection (AED) systems. The Internet of Things
    (IoT) approach allows wide-covering, powerful AED systems
    to be distributed across the Internet. But a good feature to prefilter
    the audio is critical to these systems. This work evaluates
    the potential of detecting screaming and affective speech using
    Auditory Roughness and proposes a very…

    Detecting human screaming, shouting, and other verbal manifestations
    of fear and anger are of great interest to security Audio
    Event Detection (AED) systems. The Internet of Things
    (IoT) approach allows wide-covering, powerful AED systems
    to be distributed across the Internet. But a good feature to prefilter
    the audio is critical to these systems. This work evaluates
    the potential of detecting screaming and affective speech using
    Auditory Roughness and proposes a very light-weight approximation
    method. Our approximation uses a similar amount of
    Multiple Add Accumulate (MAA) compared to short-term energy
    (STE), and at least 10× less MAA than MFCC. We evaluated
    the performance of our approximated roughness on the
    Mandarin Affective Speech corpus and a subset of the Youtube
    AudioSet for screaming against other low-complexity features.
    We show that our approximated roughness returns higher accuracy.

    Other authors
    See publication
  • Selecting frames for automatic speech recognition based on acoustic landmarks

    The Journal of the Acoustical Society of America

    Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum…

    Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR.

    Other authors
    See publication
Join now to see all publications

Courses

  • Advanced Digital Signal Processing

    ECE551

  • Applied Parallel Programming

    ECE408

  • DATA MINING PRINCIPLES

    CS 512

  • Distributed Systems

    CS425

  • Intro to Data Mining

    CS412

  • Multimedia Signal Processing

    ECE417

  • Non-linear & Adaptive Control

    ECE517

  • Pattern Recognition

    ECE544

  • Random Processes

    ECE534

Projects

  • Augmenting Input Method Language Model with user Location Type Information

    -

    This folder open source the work I have done for the CS 512 (Data Mining Principles) final project at the University of Illinois at Urbana-Champaign in 2017 Spring.

    The project focus on improving input methods' ability to predict the users' next word by leveraging Geo-temporal information. The statistic model used for word prediction is a hybrid Deep Neural Network (DNN) with projection layers (fully connected layers) and bi-directional LSTM layers. The model also leverages word…

    This folder open source the work I have done for the CS 512 (Data Mining Principles) final project at the University of Illinois at Urbana-Champaign in 2017 Spring.

    The project focus on improving input methods' ability to predict the users' next word by leveraging Geo-temporal information. The statistic model used for word prediction is a hybrid Deep Neural Network (DNN) with projection layers (fully connected layers) and bi-directional LSTM layers. The model also leverages word embedding. Using embedding results from GloVe (https://nlp.stanford.edu/projects/glove/).

    The model is evaluated on Twitter data collected in 2 weeks from the Twitter API leveraging Tweepy (http://www.tweepy.org/).

    The geographic information of these tweets has been collected using Google Place API (https://developers.google.com/places/). More details on the project can be found in the augmenting-input-method.pdf included with the repository.

    See project

Honors & Awards

  • 2017 Design Automation Conference (DAC) Hardware Design Contest 1st Place

    Design Automation Conference

    Low Power, Low Cost Audio Based Security IoT System

Languages

  • English

    Full professional proficiency

  • Chinese

    Native or bilingual proficiency

More activity by Di

View Di’s full profile

  • See who you know in common
  • Get introduced
  • Contact Di directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Di He in United States

Add new skills with these courses