Di He

Kirkland, Washington, United States
656 followers 500+ connections

View mutual connections with Di

Welcome back

Email or phone

Password

Forgot password?

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

Amazon

University of Illinois at Urbana-Champaign

About

Research Engineer at Novumind focusing in speech application and hardware friendly Deep…

Activity

Almost 18 years. I left Amazon on 2/25/2025. What stands out the most during that long tenure are the people - my colleagues, leaders and friends. I…

Almost 18 years. I left Amazon on 2/25/2025. What stands out the most during that long tenure are the people - my colleagues, leaders and friends. I…

Liked by Di He
My Amazon journey started with Alexa, and today I could not be more excited for Alexa+—our next generation assistant powered by generative AI. Built…

My Amazon journey started with Alexa, and today I could not be more excited for Alexa+—our next generation assistant powered by generative AI. Built…

Liked by Di He
I will leave FAIR / Meta in two weeks. When I joined FAIR ten years ago, it was a small team of a few dozen brilliant people and it was a privilege…

I will leave FAIR / Meta in two weeks. When I joined FAIR ten years ago, it was a small team of a few dozen brilliant people and it was a privilege…

Liked by Di He

Join now to see all activity

Experience

Amazon

Great Seattle Area
-

Greater Seattle Area
-

Santa Clara
-

Santa Clara
-

Urbana-Champaign, Illinois Area
-

Urbana-Champaign, Illinois Area
-

Greater Chicago Area
-

Austin, Texas Area
-

Austin, Texas Area

Education

University of Illinois at Urbana-Champaign

2014 - 2019
2012 - 2014
2010 - 2012
2008 - 2010

Publications

Turn-taking and backchannel prediction with acoustic and large language model fusion

ICASSP April 14, 2024

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and…

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

See publication
Two-Pass Endpoint Detection for Speech Recognition

ASRU December 16, 2023

Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs…

Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

See publication
Personalized predictive asr for latency reduction in voice assistants

Interspeech August 20, 2023

Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech…

Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.

See publication
VADOI: Voice-activity-detection overlapping inference for end-to-end long-form speech recognition

ICASSP May 23, 2022

While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over-lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However…

While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over-lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost.

See publication
Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Interspeech September 10, 2021

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The…

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0.

See publication
Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

The Journal of the Acoustical Society of America Jun 2018
Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be…

Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, experiments are conducted on the TIMIT corpus, with both Gaussian mixture model (GMM) and deep neural network (DNN)-based ASR systems, and it is found that frames containing landmarks are more informative for ASR than others. It is discovered that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of the hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy outperforms other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.

Other authors
See publication
Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

Interspeech 2018
Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors…

Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.

Other authors
Machine learning on FPGAs to face the IoT revolution

ICCAD December 14, 2017
FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. High-level synthesis (HLS) is an effective design flow for DNNs due to improved productivity, debugging, and design space exploration ability. However, optimizing large neural networks under resource constraints for FPGAs is still a key challenge. In this paper, we present a series of effective design techniques for…

FPGAs have been rapidly adopted for acceleration of Deep Neural Networks (DNNs) with improved latency and energy efficiency compared to CPU and GPU-based implementations. High-level synthesis (HLS) is an effective design flow for DNNs due to improved productivity, debugging, and design space exploration ability. However, optimizing large neural networks under resource constraints for FPGAs is still a key challenge. In this paper, we present a series of effective design techniques for implementing DNNs on FPGAs with high performance and energy efficiency. These include the use of configurable DNN IPs, performance and resource modeling, resource allocation across DNN layers, and DNN reduction and re-training. We showcase several design solutions including Long-term Recurrent Convolution Network (LRCN) for video captioning, Inception module for FaceNet face recognition, as well as Long Short-Term Memory (LSTM) for sound recognition. These and other similar DNN solutions are ideal implementations to be deployed in vision or sound based IoT applications.

Other authors
See publication
Using Approximated Auditory Roughness as a Pre-filtering Feature for Human Screaming and Affective Speech AED

Interspeech Jul 2017
Detecting human screaming, shouting, and other verbal manifestations
of fear and anger are of great interest to security Audio
Event Detection (AED) systems. The Internet of Things
(IoT) approach allows wide-covering, powerful AED systems
to be distributed across the Internet. But a good feature to prefilter
the audio is critical to these systems. This work evaluates
the potential of detecting screaming and affective speech using
Auditory Roughness and proposes a very…

Detecting human screaming, shouting, and other verbal manifestations
of fear and anger are of great interest to security Audio
Event Detection (AED) systems. The Internet of Things
(IoT) approach allows wide-covering, powerful AED systems
to be distributed across the Internet. But a good feature to prefilter
the audio is critical to these systems. This work evaluates
the potential of detecting screaming and affective speech using
Auditory Roughness and proposes a very light-weight approximation
method. Our approximation uses a similar amount of
Multiple Add Accumulate (MAA) compared to short-term energy
(STE), and at least 10× less MAA than MFCC. We evaluated
the performance of our approximated roughness on the
Mandarin Affective Speech corpus and a subset of the Youtube
AudioSet for screaming against other low-complexity features.
We show that our approximated roughness returns higher accuracy.

Other authors
See publication
Selecting frames for automatic speech recognition based on acoustic landmarks

The Journal of the Acoustical Society of America Jun 2017
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum…

Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR.

Other authors
See publication

Join now to see all publications

Courses

Advanced Digital Signal Processing

ECE551
Applied Parallel Programming

ECE408
DATA MINING PRINCIPLES

CS 512
Distributed Systems

CS425
Intro to Data Mining

CS412
Multimedia Signal Processing

ECE417
Non-linear & Adaptive Control

ECE517
Pattern Recognition

ECE544
Random Processes

ECE534

Projects

Augmenting Input Method Language Model with user Location Type Information

Apr 2017 - May 2017

This folder open source the work I have done for the CS 512 (Data Mining Principles) final project at the University of Illinois at Urbana-Champaign in 2017 Spring.

The project focus on improving input methods' ability to predict the users' next word by leveraging Geo-temporal information. The statistic model used for word prediction is a hybrid Deep Neural Network (DNN) with projection layers (fully connected layers) and bi-directional LSTM layers. The model also leverages word…

This folder open source the work I have done for the CS 512 (Data Mining Principles) final project at the University of Illinois at Urbana-Champaign in 2017 Spring.

The project focus on improving input methods' ability to predict the users' next word by leveraging Geo-temporal information. The statistic model used for word prediction is a hybrid Deep Neural Network (DNN) with projection layers (fully connected layers) and bi-directional LSTM layers. The model also leverages word embedding. Using embedding results from GloVe (https://nlp.stanford.edu/projects/glove/).

The model is evaluated on Twitter data collected in 2 weeks from the Twitter API leveraging Tweepy (http://www.tweepy.org/).

The geographic information of these tweets has been collected using Google Place API (https://developers.google.com/places/). More details on the project can be found in the augmenting-input-method.pdf included with the repository.

See project

Honors & Awards

2017 Design Automation Conference (DAC) Hardware Design Contest 1st Place

Design Automation Conference

Jul 2017

Low Power, Low Cost Audio Based Security IoT System

Languages

English

Full professional proficiency
Chinese

Native or bilingual proficiency

More activity by Di

Through collaboration with researchers of economics, we evaluated ChatGPT, DeepSeek and other LLMs for their strategic reasoning capabilities. When…

Through collaboration with researchers of economics, we evaluated ChatGPT, DeepSeek and other LLMs for their strategic reasoning capabilities. When…

Liked by Di He
Hi all, My team in Bellevue WA is continuing to look for Applied Scientists to join our team. We have ONE senior position and SEVERAL intern…

Hi all, My team in Bellevue WA is continuing to look for Applied Scientists to join our team. We have ONE senior position and SEVERAL intern…

Liked by Di He
After 2 years of hardwork, I am thrilled to share our NeurIPS 2024 paper “Condition-Aware Self-Supervised Learning Representation for General Speech…

After 2 years of hardwork, I am thrilled to share our NeurIPS 2024 paper “Condition-Aware Self-Supervised Learning Representation for General Speech…

Liked by Di He
We presented two papers at NeurIPS’24. One is titled “SnapKV: LLM Knows What You are Looking for Before Generation”and another “Decision-Making…

We presented two papers at NeurIPS’24. One is titled “SnapKV: LLM Knows What You are Looking for Before Generation”and another “Decision-Making…

Liked by Di He
🔶 Many leading AI solutions are based on large language models (LLMs) that are expensive to run—in part because of inefficient use of computational…

🔶 Many leading AI solutions are based on large language models (LLMs) that are expensive to run—in part because of inefficient use of computational…

Liked by Di He

View Di’s full profile

See who you know in common
Get introduced
Contact Di directly

Join to view full profile

Other similar profiles

Explore more posts

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Di He in United States

29 others named Di He in United States are on LinkedIn

See others named Di He

Add new skills with these courses

See all courses

Di He

Kirkland, Washington, United States 656 followers 500+ connections

About

Activity

Almost 18 years. I left Amazon on 2/25/2025. What stands out the most during that long tenure are the people - my colleagues, leaders and friends. I…

Liked by Di He

My Amazon journey started with Alexa, and today I could not be more excited for Alexa+—our next generation assistant powered by generative AI. Built…

Liked by Di He

I will leave FAIR / Meta in two weeks. When I joined FAIR ten years ago, it was a small team of a few dozen brilliant people and it was a privilege…

Liked by Di He

Experience

-

-

-

-

-

-

-

-

Education

Publications

ICASSP April 14, 2024

ASRU December 16, 2023

Interspeech August 20, 2023

ICASSP May 23, 2022

Interspeech September 10, 2021

The Journal of the Acoustical Society of America Jun 2018

Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks

Interspeech 2018

ICCAD December 14, 2017

Interspeech Jul 2017

The Journal of the Acoustical Society of America Jun 2017

Courses

Advanced Digital Signal Processing

ECE551

Applied Parallel Programming

ECE408

DATA MINING PRINCIPLES

CS 512

Distributed Systems

CS425

Intro to Data Mining

CS412

Multimedia Signal Processing

ECE417

Non-linear & Adaptive Control

ECE517

Pattern Recognition

ECE544

Random Processes

ECE534

Projects

Apr 2017 - May 2017

Honors & Awards

2017 Design Automation Conference (DAC) Hardware Design Contest 1st Place

Design Automation Conference

Languages

English

Full professional proficiency

Chinese

Native or bilingual proficiency

More activity by Di

Through collaboration with researchers of economics, we evaluated ChatGPT, DeepSeek and other LLMs for their strategic reasoning capabilities. When…

Liked by Di He

Hi all, My team in Bellevue WA is continuing to look for Applied Scientists to join our team. We have ONE senior position and SEVERAL intern…

Liked by Di He

After 2 years of hardwork, I am thrilled to share our NeurIPS 2024 paper “Condition-Aware Self-Supervised Learning Representation for General Speech…

Liked by Di He

We presented two papers at NeurIPS’24. One is titled “SnapKV: LLM Knows What You are Looking for Before Generation”and another “Decision-Making…

Liked by Di He

🔶 Many leading AI solutions are based on large language models (LLMs) that are expensive to run—in part because of inefficient use of computational…

Liked by Di He

View Di’s full profile

Other similar profiles

Tianpei (Luke) Xie

Zimeng Qiu

Hao Li

Weijie Xu

JI LI

Aman Alok

Kirkland, Washington, United States
656 followers 500+ connections