Publications

This is a listing of most of my publications, arranged in reverse chronological order. Where available, publications include links to implementation code and demonstration pages. Additional publications can be found on my Google Scholar profile.

Papers that are downloaded are solely for personal use, which includes educational or research purposes, and they cannot be used for commercial redistribution or promotion.

Large Language Model Guided Decoding for Self-Supervised Speech Recognition

Eyal Cohen, Bhiksha Raj, Joseph Keshet

Preprint, 2026

This paper introduces a novel zero-shot ASR decoding method that tightly integrates LLMs with SSL acoustic models while keeping them separable and independently trainable. The approach iteratively guides the ASR decoder using LLMs to sample candidate tokens and align them with acoustic signals.

New

Spectral Analysis of Diffusion Models with Application to Schedule Design

Roi Benita, Michael Elad, Joseph Keshet

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

We propose a spectral analysis of the diffusion model’s inference, treating it as a transfer function that maps initial noise to the generated signal. This analysis leads to a mechanism for designing noise schedules.

New

Large Language Model Guided Decoding for Self-Supervised Speech Recognition

Eyal Cohen, Bhiksha Raj, Joseph Keshet

Preprint, 2026

This paper introduces a novel zero-shot ASR decoding method that tightly integrates LLMs with SSL acoustic models while keeping them separable and independently trainable. The approach iteratively guides the ASR decoder using LLMs to sample candidate tokens and align them with acoustic signals.

Beyond Transcription: Mechanistic Interpretability in ASR

Neta Glazer, Yael Segal-Feldman, Hilit Segev, Aviv Shamsian, Asaf Buchnick, Gill Hetz, Ethan Fetaya, Joseph Keshet, Aviv Navon

The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26), 2026

Application of interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems.

DRAX: Speech Recognition with Discrete Flow Matching

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya

Preprint, 2025.

Spectral Analysis of Diffusion Models with Application to Schedule Design

Roi Benita, Michael Elad, Joseph Keshet

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

We propose a spectral analysis of the diffusion model’s inference, treating it as a transfer function that maps initial noise to the generated signal. This analysis leads to a mechanism for designing noise schedules.

CarelessWhisper: Turning Whisper into a Causal Streaming Model

Tomer Krichli, Bhiksha Raj, Joseph Keshet

Preprint, 2025.

This paper proposes a method to transform the transformer encoder–decoder architecture into a low-latency streaming model that does not rely on future context. It presents an analysis explaining why it is not straightforward to convert an encoder-decoder transformer to a low-latency streaming model. The proposed changes to the pre-trained model achieves superior performance compared to existing streaming approaches, while operating with lower complexity.

How Does a Deep Neural Network Look at Lexical Stress?

Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet

Preprint, 2025.

This paper investigates which acoustic features deep neural networks use to identify stressed syllables, employing interpretability methods to reveal the cues influencing the model’s decisions. We introduce a new lexical stress dataset, automatically collected without human annotation. We present a state-of-the-art lexical stress classifier that generalizes to new words, including those from unseen languages. Finally, we apply Layer-wise Relevance Propagation (LRP) to uncover the spectral features driving the classifier’s high accuracy.

Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices

Yael Segal-Feldman, Ann R. Bradlow, Matthew Goldrick, Joseph Keshet

Preprint, 2025.

This paper presents an open-vocabulary keyword spotting model that achieves state-of-the-art detection accuracy on small-footprint devices. The architecture consists of three main components: a speech encoder, a target keyword encoder, and a detection network. The speech encoder is implemented using either a Tiny Whisper or a Tiny Conformer model. The target keyword encoder operates as a hyper-network, taking the desired keyword as a character string and generating a unique set of weights for a convolutional layer—effectively acting as a keyword-specific matched filter. The detection network applies these matched-filter weights to perform a keyword-specific convolution, which in turn guides the cross-attention mechanism of a Perceiver module to determine whether the target term appears in the audio recording.

PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

Bronya Roni Chernyak, Yael Segal, Yosi Shrem, Joseph Keshet

Preprint, 2025.

WhisperNER: Unified Open Named Entity and Speech Recognition

Gil Ayache, Menachem Pirchi, Aviv Navon, Aviv Shamsian, Gill Hetz, Joseph Keshet

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025

WhisperNER is a unified model for automatic speech recognition (ASR) and named entity recognition (NER), with zero-shot capabilities. The WhisperNER model is designed as a strong base model for the downstream task of ASR with NER, and can be fine-tuned on specific datasets for improved performance.

UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

Neta Glazer, Aviv Navon, Yael Segal-Feldman, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet

Workshop on Machine Learning for Audio, ICML 2025

UmbraTTS is a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context.

Self-Supervised Speech Representations in a Pre-train Speech Model Represent Key Rapid Automatized Naming Variability in Autism

Sarah Ethridge, Joe Lau, Bronya R Chernyak, Robert Voigt, Matt Goldrick, Joseph Keshet, Molly Losh

Society for Computation in Linguistics, 8(1), 2025.

Self-supervised pre-trained speech models, such as HuBERT, capture meaningful variability in the cognitive-linguistic patterns of autism without need for pre-defined acoustic features or speech-to-text alignment.

A Front-End Adaptation Network for Improving Speech Recognition Performance in Packet Loss and Noisy Environments

Yehoshua Dissen, Shiry Yonash, Israel Cohen, and Joseph Keshet

IEEE Transactions on Audio, Speech and Language Processing, Volume 33, pp. 2175-2188, 2025

FlowTSE: Target Speaker Extraction with Flow Matching

Aviv Navon, Aviv Shamsian, Yael Segal-Feldman, Neta Glazer, Gil Hetz, Joseph Keshet

The 26th Annual Conference of the International Speech Communication Association (Interspeech), 2025.

Whisper in Medusa’s Ear: Multi-head Efficient Decoding for Transformer-based ASR

Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

Whisper is a powerful encoder-decoder model for speech transcription and translation. To accelerate its inference, we propose two architectures that extend Whisper by enabling multi-token prediction per iteration.

Enhancing analysis of diadochokinetic speech using deep neural networks

Yael Segal-Feldman, Kasia Hitczenko, Matthew Goldrick, Adam Buchwald, Angela Roberts, Joseph Keshet

Computer Speech & Language, Volume 90, 101715, March 2025

Predicting relative intelligibility from inter-talker distances in a perceptual similarity space for speech

Seung-Eun Kim, Bronya R Chernyak, Joseph Keshet, Matthew Goldrick, Ann R Bradlow

Psychonomic Bulletin & Review, 2025

HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Arnon Turetzky, Or Tal, Yael Segal-Feldman, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi

The 25th Annual Conference of the International Speech Communication Association (Interspeech), 2024

HEBDB contains natural dialogues of spontaneous speech. It is comprised of both testimonies from World War II survivors and five podcasts covering a wide range of subjects and speakers. While the testimonies provide firsthand accounts of historical events, the majority of our dataset consists of podcasts covering diverse topics such as economy, politics, sports, culture, science, history, and music, to name a few. We provide two versions of the dataset: raw and pre-processed.

Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet

The 25th Annual Conference of the International Speech Communication Association (Interspeech), 2024

Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment

Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff

The 25th Annual Conference of the International Speech Communication Association (Interspeech), 2024

Keyword-Guided Adaptation of Automatic Speech Recognition

Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

The 25th Annual Conference of the International Speech Communication Association (Interspeech), 2024

A Perceptual Similarity Space For Speech Based On Self-Supervised Speech Representations

Bronya R Chernyak, Ann R Bradlow, Joseph Keshet, and Matthew Goldrick

Journal of the Acoustical Society of America, Vol. 155, Issue 6, pp. 3915–3929, 2024

This study introduces a perceptual similarity space for speech, developed using self-supervised learning, which captures distinctions between speech samples without relying on pre-defined acoustic features or alignment to text.

Open Vocabulary Keyword-Spotting with Adaptive Instance Normalization

Aviv Navon, Aviv Shamsian, Neta Glazer, Gill Hetz, Joseph Keshet

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11656-11660, 2024

Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance.

Automatic Recognition of Second Language Speech-in-Noise

Seung-Eun Kim, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Matthew Goldrick, Ann R. Bradlow

JASA Express Letters, Volume 4, Issue 2, 2024

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Roi Benita, Michael Elad, Joseph Keshet

International Conference on Learning Representations (ICLR), 2024

Speech Characteristics Yield Important Clues About Motor Function: Speech Variability in Individuals at Clinical High-Risk for Psychosis

Kasia Hitczenko, Yael Segal, Joseph Keshet, Matthew Goldrick, Vijay A Mittal

Nature Schizophrenia, Volume 9, Article number: 60, 2023

Speech characteristics yield important clues about motor function: Speech variability in individuals at clinical high-risk for psychosis.

Using Automatic Acoustic Analysis to Reveal Disruptions to Speech Articulation in Individuals at Risk for Psychosis

Kasia Hitczenko, Yael Segal, Joseph Keshet, Vijay Mittal, Matthew Goldrick

Journal of the Acoustical Society of America, Volume 153, A290, 2023

A Baseline for Detecting Out-of-Distribution Examples in Image Captioning

Gal Shalev, Gabi Shalev, Joseph Keshet

The 30th ACM International Conference on Multimedia, pp. 4175 – 4184, 2022