Treffer: NaturalL2S: End-to-end high-quality multispeaker lip-to-speech synthesis with differential digital signal processing.

Title:
NaturalL2S: End-to-end high-quality multispeaker lip-to-speech synthesis with differential digital signal processing.
Authors:
Liang Y; Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China., Liu F; Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China., Li A; Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China., Li X; Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China., Lei C; Wuhan Second Ship Design and Research Institute, Wuhan, 430205, China., Zheng C; Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China. Electronic address: cszheng@mail.ioa.ac.cn.
Source:
Neural networks : the official journal of the International Neural Network Society [Neural Netw] 2026 Feb; Vol. 194, pp. 108163. Date of Electronic Publication: 2025 Oct 01.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Pergamon Press Country of Publication: United States NLM ID: 8805018 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1879-2782 (Electronic) Linking ISSN: 08936080 NLM ISO Abbreviation: Neural Netw Subsets: MEDLINE
Imprint Name(s):
Original Publication: New York : Pergamon Press, [c1988-
Contributed Indexing:
Keywords: Differentiable digital signal process; End-to-end training; Lip-to-speech; Speech reconstruction
Entry Date(s):
Date Created: 20251014 Date Completed: 20251216 Latest Revision: 20251216
Update Code:
20260130
DOI:
10.1016/j.neunet.2025.108163
PMID:
41086797
Database:
MEDLINE

Weitere Informationen

Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework that jointly trains the vocoder with the acoustic inductive priors. Specifically, our architecture introduces a fundamental frequency (F0) predictor to explicitly model prosodic variations, where the predicted F0 contour drives a differentiable digital signal processing (DDSP) synthesizer to provide acoustic priors for subsequent refinement. Notably, the proposed system achieves satisfactory performance on speaker similarity without requiring explicit speaker embeddings. Both objective metrics and subjective listening tests demonstrate that NaturalL2S significantly enhances synthesized speech quality compared to existing state-of-the-art methods. Audio samples are available on our demonstration page: https://yifan-liang.github.io/NaturalL2S/.
(Copyright © 2025 Elsevier Ltd. All rights reserved.)

Declaration of competing interest The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.