A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
(简体中文|English)
FunASR hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!
Highlights
| News
| Installation
| Quick Start
| Tutorial
| Runtime
| Model Zoo
| Contact
python>=3.8
torch>=1.13
torchaudio
pip3 install -U funasr
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
pip3 install -U modelscope huggingface_hub
FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the Model License Agreement. Below are some representative models, for more models please refer to the Model Zoo.
(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)
Model Name | Task Details | Training Data | Parameters |
---|---|---|---|
SenseVoiceSmall (⭐ 🤗 ) |
multiple speech understanding capabilities, including ASR, ITN, LID, SER, and AED, support languages such as zh, yue, en, ja, ko | 300000 hours | 234M |
paraformer-zh (⭐ 🤗 ) |
speech recognition, with timestamps, non-streaming | 60000 hours, Mandarin | 220M |
( ⭐ 🤗 ) |
speech recognition, streaming | 60000 hours, Mandarin | 220M |
paraformer-en ( ⭐ 🤗 ) |
speech recognition, without timestamps, non-streaming | 50000 hours, English | 220M |
conformer-en ( ⭐ 🤗 ) |
speech recognition, non-streaming | 50000 hours, English | 220M |
ct-punc ( ⭐ 🤗 ) |
punctuation restoration | 100M, Mandarin and English | 290M |
fsmn-vad ( ⭐ 🤗 ) |
voice activity detection | 5000 hours, Mandarin and English | 0.4M |
fsmn-kws ( ⭐ ) |
keyword spotting,streaming | 5000 hours, Mandarin | 0.7M |
fa-zh ( ⭐ 🤗 ) |
timestamp prediction | 5000 hours, Mandarin | 38M |
cam++ ( ⭐ 🤗 ) |
speaker verification/diarization | 5000 hours | 7.2M |
Whisper-large-v3 (⭐ 🍀 ) |
speech recognition, with timestamps, non-streaming | multilingual | 1550 M |
Whisper-large-v3-turbo (⭐ 🍀 ) |
speech recognition, with timestamps, non-streaming | multilingual | 809 M |
Qwen-Audio (⭐ 🤗 ) |
audio-text multimodal models (pretraining) | multilingual | 8B |
Qwen-Audio-Chat (⭐ 🤗 ) |
audio-text multimodal models (chat) | multilingual | 8B |
emotion2vec+large (⭐ 🤗 ) |
speech emotion recongintion | 40000 hours | 300M |
Below is a quick start tutorial. Test audio files (Mandarin, English).
funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=asr_example_zh.wav
Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: wav_id wav_pat
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model_dir = "iic/SenseVoiceSmall"
model = AutoModel(
model=model_dir,
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
)
# en
res = model.generate(
input=f"{model.model_path}/example/en.mp3",
cache={},
language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
use_itn=True,
batch_size_s=60,
merge_vad=True, #
merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)
Parameter Description:
model_dir
: The name of the model, or the path to the model on the local disk.vad_model
: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model’s inference time separately, the VAD model can be disabled.vad_kwargs
: Specifies the configurations for the VAD model. max_single_segment_time
: denotes the maximum duration for audio segmentation by the vad_model
, with the unit being milliseconds (ms).use_itn
: Whether the output result includes punctuation and inverse text normalization.batch_size_s
: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).merge_vad
: Whether to merge short audio fragments segmented by the VAD model, with the merged length being merge_length_s
, in seconds (s).ban_emo_unk
: Whether to ban the output of the emo_unk
token.from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc",
# spk_model="cam++",
)
res = model.generate(input=f"{model.model_path}/example/asr_example.wav",
batch_size_s=300,
hotword='魔搭')
print(res)
Note: hub
: represents the model repository, ms
stands for selecting ModelScope download, hf
stands for selecting Huggingface download.
from funasr import AutoModel
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel(model="paraformer-zh-streaming")
import soundfile
import os
wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
print(res)
Note: chunk_size
is the configuration for streaming latency. [0,10,5]
indicates that the real-time display granularity is 10*60=600ms
, and the lookahead information is 5*60=300ms
. Each inference input is 600ms
(sample points are 16000*0.6=960
), and the output is the corresponding text. For the last speech segment input, is_final=True
needs to be set to force the output of the last word.
from funasr import AutoModel
model = AutoModel(model="fsmn-vad")
wav_file = f"{model.model_path}/example/vad_example.wav"
res = model.generate(input=wav_file)
print(res)
Note: The output format of the VAD model is: [[beg1, end1], [beg2, end2], ..., [begN, endN]]
, where begN/endN
indicates the starting/ending point of the N-th
valid audio segment, measured in milliseconds.
from funasr import AutoModel
chunk_size = 200 # ms
model = AutoModel(model="fsmn-vad")
import soundfile
wav_file = f"{model.model_path}/example/vad_example.wav"
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = int(chunk_size * sample_rate / 1000)
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
if len(res[0]["value"]):
print(res)
Note: The output format for the streaming VAD model can be one of four scenarios:
[[beg1, end1], [beg2, end2], .., [begN, endN]]
:The same as the offline VAD output result mentioned above.[[beg, -1]]
:Indicates that only a starting point has been detected.[[-1, end]]
:Indicates that only an ending point has been detected.[]
:Indicates that neither a starting point nor an ending point has been detected.The output is measured in milliseconds and represents the absolute time from the starting point.
from funasr import AutoModel
model = AutoModel(model="ct-punc")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res)
from funasr import AutoModel
model = AutoModel(model="fa-zh")
wav_file = f"{model.model_path}/example/asr_example.wav"
text_file = f"{model.model_path}/example/text.txt"
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
print(res)
from funasr import AutoModel
model = AutoModel(model="emotion2vec_plus_large")
wav_file = f"{model.model_path}/example/test.wav"
res = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(res)
funasr-export ++model=paraformer ++quantize=false ++device=cpu
from funasr import AutoModel
model = AutoModel(model="paraformer", device="cpu")
res = model.export(quantize=False)
# pip3 install -U funasr-onnx
from funasr_onnx import Paraformer
model_dir = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer(model_dir, batch_size=1, quantize=True)
wav_path = ['~/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav']
result = model(wav_path)
print(result)
More examples ref to demo
FunASR supports deploying pre-trained or further fine-tuned models for service. Currently, it supports the following types of service deployment:
For more detailed information, please refer to the service deployment documentation.
If you encounter problems in use, you can directly raise Issues on the github page.
You can also scan the following DingTalk group to join the community group for communication and discussion.
DingTalk group |
---|
The contributors can be found in contributors list
This project is licensed under The MIT License. FunASR also contains various third-party components and some code modified from other repos under other open source licenses.
The use of pretraining model is subject to model license
@inproceedings{gao2023funasr,
author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},
title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
year={2023},
booktitle={INTERSPEECH},
}
@inproceedings{An2023bat,
author={Keyu An and Xian Shi and Shiliang Zhang},
title={BAT: Boundary aware transducer for memory-efficient and low-latency ASR},
year={2023},
booktitle={INTERSPEECH},
}
@inproceedings{gao22b_interspeech,
author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={2063--2067},
doi={10.21437/Interspeech.2022-9996}
}
@inproceedings{shi2023seaco,
author={Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang},
title={SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability},
year={2023},
booktitle={ICASSP2024}
}