About
Research Engineer at Novumind focusing in speech application and hardware friendly Deep…
Activity
-
Almost 18 years. I left Amazon on 2/25/2025. What stands out the most during that long tenure are the people - my colleagues, leaders and friends. I…
Almost 18 years. I left Amazon on 2/25/2025. What stands out the most during that long tenure are the people - my colleagues, leaders and friends. I…
Liked by Di He
-
My Amazon journey started with Alexa, and today I could not be more excited for Alexa+—our next generation assistant powered by generative AI. Built…
My Amazon journey started with Alexa, and today I could not be more excited for Alexa+—our next generation assistant powered by generative AI. Built…
Liked by Di He
-
I will leave FAIR / Meta in two weeks. When I joined FAIR ten years ago, it was a small team of a few dozen brilliant people and it was a privilege…
I will leave FAIR / Meta in two weeks. When I joined FAIR ten years ago, it was a small team of a few dozen brilliant people and it was a privilege…
Liked by Di He
Experience
Education
Publications
-
Turn-taking and backchannel prediction with acoustic and large language model fusion
ICASSP
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and…
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.
-
Two-Pass Endpoint Detection for Speech Recognition
ASRU
Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs…
Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.
-
Personalized predictive asr for latency reduction in voice assistants
Interspeech
Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech…
Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.
-
VADOI: Voice-activity-detection overlapping inference for end-to-end long-form speech recognition
ICASSP
While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over-lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However…
While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over-lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost.
-
Wav2vec-C: A Self-supervised Model for Speech Representation Learning
Interspeech
Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The…
Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0.
-
Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model
The Journal of the Acoustical Society of America
Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be…
Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, experiments are conducted on the TIMIT corpus, with both Gaussian mixture model (GMM) and deep neural network (DNN)-based ASR systems, and it is found that frames containing landmarks are more informative for ASR than others. It is discovered that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of the hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy outperforms other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.
Other authorsSee publication -
Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks
Interspeech
Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors…
Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.
Other authors -
Machine learning on FPGAs to face the IoT revolution
ICCAD
FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. High-level synthesis (HLS) is an effective design flow for DNNs due to improved productivity, debugging, and design space exploration ability. However, optimizing large neural networks under resource constraints for FPGAs is still a key challenge. In this paper, we present a series of effective design techniques for…
FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. High-level synthesis (HLS) is an effective design flow for DNNs due to improved productivity, debugging, and design space exploration ability. However, optimizing large neural networks under resource constraints for FPGAs is still a key challenge. In this paper, we present a series of effective design techniques for implementing DNNs on FPGAs with high performance and energy efficiency. These include the use of configurable DNN IPs, performance and resource modeling, resource allocation across DNN layers, and DNN reduction and re-training. We showcase several design solutions including Long-term Recurrent Convolution Network (LRCN) for video captioning, Inception module for FaceNet face recognition, as well as Long Short-Term Memory (LSTM) for sound recognition. These and other similar DNN solutions are ideal implementations to be deployed in vision or sound based IoT applications.
Other authorsSee publication -
Using Approximated Auditory Roughness as a Pre-filtering Feature for Human Screaming and Affective Speech AED
Interspeech
Detecting human screaming, shouting, and other verbal manifestations
of fear and anger are of great interest to security Audio
Event Detection (AED) systems. The Internet of Things
(IoT) approach allows wide-covering, powerful AED systems
to be distributed across the Internet. But a good feature to prefilter
the audio is critical to these systems. This work evaluates
the potential of detecting screaming and affective speech using
Auditory Roughness and proposes a very…Detecting human screaming, shouting, and other verbal manifestations
of fear and anger are of great interest to security Audio
Event Detection (AED) systems. The Internet of Things
(IoT) approach allows wide-covering, powerful AED systems
to be distributed across the Internet. But a good feature to prefilter
the audio is critical to these systems. This work evaluates
the potential of detecting screaming and affective speech using
Auditory Roughness and proposes a very light-weight approximation
method. Our approximation uses a similar amount of
Multiple Add Accumulate (MAA) compared to short-term energy
(STE), and at least 10× less MAA than MFCC. We evaluated
the performance of our approximated roughness on the
Mandarin Affective Speech corpus and a subset of the Youtube
AudioSet for screaming against other low-complexity features.
We show that our approximated roughness returns higher accuracy.
Other authorsSee publication -
Selecting frames for automatic speech recognition based on acoustic landmarks
The Journal of the Acoustical Society of America
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum…
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR.
Other authorsSee publication
Courses
-
Advanced Digital Signal Processing
ECE551
-
Applied Parallel Programming
ECE408
-
DATA MINING PRINCIPLES
CS 512
-
Distributed Systems
CS425
-
Intro to Data Mining
CS412
-
Multimedia Signal Processing
ECE417
-
Non-linear & Adaptive Control
ECE517
-
Pattern Recognition
ECE544
-
Random Processes
ECE534
Projects
-
Augmenting Input Method Language Model with user Location Type Information
-
This folder open source the work I have done for the CS 512 (Data Mining Principles) final project at the University of Illinois at Urbana-Champaign in 2017 Spring.
The project focus on improving input methods' ability to predict the users' next word by leveraging Geo-temporal information. The statistic model used for word prediction is a hybrid Deep Neural Network (DNN) with projection layers (fully connected layers) and bi-directional LSTM layers. The model also leverages word…This folder open source the work I have done for the CS 512 (Data Mining Principles) final project at the University of Illinois at Urbana-Champaign in 2017 Spring.
The project focus on improving input methods' ability to predict the users' next word by leveraging Geo-temporal information. The statistic model used for word prediction is a hybrid Deep Neural Network (DNN) with projection layers (fully connected layers) and bi-directional LSTM layers. The model also leverages word embedding. Using embedding results from GloVe (https://nlp.stanford.edu/projects/glove/).
The model is evaluated on Twitter data collected in 2 weeks from the Twitter API leveraging Tweepy (http://www.tweepy.org/).
The geographic information of these tweets has been collected using Google Place API (https://developers.google.com/places/). More details on the project can be found in the augmenting-input-method.pdf included with the repository.
Honors & Awards
-
2017 Design Automation Conference (DAC) Hardware Design Contest 1st Place
Design Automation Conference
Low Power, Low Cost Audio Based Security IoT System
Languages
-
English
Full professional proficiency
-
Chinese
Native or bilingual proficiency
More activity by Di
-
Through collaboration with researchers of economics, we evaluated ChatGPT, DeepSeek and other LLMs for their strategic reasoning capabilities. When…
Through collaboration with researchers of economics, we evaluated ChatGPT, DeepSeek and other LLMs for their strategic reasoning capabilities. When…
Liked by Di He
-
Hi all, My team in Bellevue WA is continuing to look for Applied Scientists to join our team. We have ONE senior position and SEVERAL intern…
Hi all, My team in Bellevue WA is continuing to look for Applied Scientists to join our team. We have ONE senior position and SEVERAL intern…
Liked by Di He
-
After 2 years of hardwork, I am thrilled to share our NeurIPS 2024 paper “Condition-Aware Self-Supervised Learning Representation for General Speech…
After 2 years of hardwork, I am thrilled to share our NeurIPS 2024 paper “Condition-Aware Self-Supervised Learning Representation for General Speech…
Liked by Di He
-
We presented two papers at NeurIPS’24. One is titled “SnapKV: LLM Knows What You are Looking for Before Generation”and another “Decision-Making…
We presented two papers at NeurIPS’24. One is titled “SnapKV: LLM Knows What You are Looking for Before Generation”and another “Decision-Making…
Liked by Di He
-
🔶 Many leading AI solutions are based on large language models (LLMs) that are expensive to run—in part because of inefficient use of computational…
🔶 Many leading AI solutions are based on large language models (LLMs) that are expensive to run—in part because of inefficient use of computational…
Liked by Di He
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Di He in United States
-
Di He
Strategy & Operations in Tech | 🔥 Turning Chaos into Strategy & Big Ideas into Scalable Impact
-
Di He
Head of AGT Product Life Cycle
-
Di He
-
Di H.
29 others named Di He in United States are on LinkedIn
See others named Di He