Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
Unified-modal speech-text pre-training for spoken language processing:
SpeechT5 (
ACL 2022
): SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing
Speech2C (
INTERSPEECH 2022
): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
YiTrans (
IWSLT 2022
): The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task
SpeechUT (
EMNLP 2022
): SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training
SpeechLM (
IEEE/ACM TASLP
): SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
Speech2S (
ICASSP 2023
): Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation
Prosody-SpeechT5 (
ICASSP 2023
): Prosody-aware SpeechT5 for Expressive Neural TTS
VATLM (
IEEE Transactions on Multimedia
): VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
VALL-E X (
Arxiv 2023
): Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
VioLA (
Arxiv 2023
): VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
WavLLM (
Arxiv 2024
): WavLLM: Towards Robust and Adaptive Speech Large Language Model
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.
Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text.
To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder.
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
We evaluate our models on typical spoken language processing tasks, including automatic speech recognition, text to speech, speech to text translation, voice conversion, speech enhancement, and speaker identification.
Evaluation on the LibriSpeech
Model | LM | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|---|
wav2vec2.0 Base | - | 6.1 | 13.5 | 6.1 | 13.3 |
HuBERT Base | - | 5.5 | 13.1 | 5.8 | 13.3 |
Baseline (w/o CTC) | - | 5.8 | 12.3 | 6.2 | 12.3 |
Baseline | - | 4.9 | 11.7 | 5.0 | 11.9 |
SpeechT5 (w/o CTC) | - | 5.4 | 10.7 | 5.8 | 10.7 |
SpeechT5 | - | 4.3 | 10.3 | 4.4 | 10.4 |
DiscreteBERT | 4-gram | 4.0 | 10.9 | 4.5 | 12.1 |
wav2vec 2.0 Base | 4-gram | 2.7 | 7.9 | 3.4 | 8.0 |
HuBERT Base | 4-gram | 2.7 | 7.8 | 3.4 | 8.1 |
wav2vec 2.0 Base | Transf. | 2.2 | 6.3 | 2.6 | 6.3 |
Baseline | Transf. | 2.3 | 6.3 | 2.5 | 6.3 |
SpeechT5 | Transf. | 2.1 | 5.5 | 2.4 | 5.8 |
Evaluation on the LibriTTS
Model | Naturalness | MOS | CMOS |
---|---|---|---|
Ground Truth | - | 3.87 | - |
Baseline | 2.76 | 3.56 | 0 |
SpeechT5 | 2.91 | 3.65 | +0.290 |
Evaluation on the MUST-C v1
Model | EN-DE | EN-FR |
---|---|---|
Fairseq ST | 22.70 | 32.90 |
ESPnet ST | 22.91 | 32.69 |
Adapter Tuning | 24.63 | 34.98 |
Baseline | 23.43 | 33.76 |
SpeechT5 (w/o initializing decoder) | 24.44 | 34.5 |
SpeechT5 | 25.18 | 35.30 |
Evaluation on the CMU Arctic
Model | WER | WER | MCD | MCD |
---|---|---|---|---|
bdl to slt | clb to slt | bdl to slt | clb to slt | |
VTN w/ ASR | 11.1 | 10.9 | 6.5 | 6.11 |
VTN w/ TTS | 7.6 | 9.1 | 6.33 | 13.3 |
Many-to-many VTN | - | - | 6.13 | 5.97 |
Baseline | 21.5 | 10.8 | 6.26 | 6.16 |
SpeechT5 | 7.8 | 6.4 | 5.93 | 5.87 |
Evaluation on the WSJ0 Hipster AmbientMixtures (WHAM!)
Model | WER |
---|---|
Ground Truth Speech | 3.2 |
Noisy Speech | 76.1 |
Baseline | 10.9 |
SpeechT5 | 8.9 |
Evaluation on the VoxCeleb1
Model | Acc |
---|---|
SUPERB, wav2vec 2.0 Base | 75.18% |
SUPERB, HuBERT Base | 81.42% |
SUPERB, HuBERT Large | 90.33% |
SpeechNet, single task | 86.00% |
SpeechNet, multi-task with TTS | 87.90% |
Thin ResNet-34 | 89.00% |
Baseline | 91.92% |
SpeechT5 | 96.49% |
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the FAIRSEQ and ESPnet projects.
Microsoft Open Source Code of Conduct
If you find our work is useful in your research, please cite the following paper:
@article{Ao2021SpeechT5,
title = {SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing},
author = {Junyi Ao and Rui Wang and Long Zhou and Chengyi Wang and Shuo Ren and Yu Wu and Shujie Liu and Tom Ko and Qing Li and Yu Zhang and Zhihua Wei and Yao Qian and Jinyu Li and Furu Wei},
eprint={2110.07205},
archivePrefix={arXiv},
primaryClass={eess.AS},
year={2021}
}
@article{Ao2022Speech2C,
title = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data},
author = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei},
eprint={2203.17113},
archivePrefix={arXiv},
primaryClass={cs.SD},
year={2022}
}
@article{Zhang2022Yitrans,
title = {The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task},
author = {Zhang, Ziqiang and Ao, Junyi and Zhou, Long and Liu, Shujie and Wei, Furu and Li, Jinyu},
eprint={2206.05777},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}
@article{zhang2022speechut,
title = {SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training},
author = {Zhang, Ziqiang and Zhou, Long and Ao, Junyi and Liu, Shujie and Dai, Lirong and Li, Jinyu and Wei, Furu},
eprint={2210.03730},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}
@article{zhang2022speechlm,
title = {SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data},
author = {Zhang, Ziqiang and Chen, Sanyuan and Zhou, Long and Wu, Yu and Ren, Shuo and Liu, Shujie and Yao, Zhuoyuan and Gong, Xun and Dai, Lirong and Li, Jinyu and Wei, Furu},
eprint={2209.15329},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}
For help or issues using SpeechT5 models, please submit a GitHub issue.
For other communications related to SpeechT5, please contact Long Zhou ([email protected]
).