Chinese text normalization for speech processing
Search for “Text Normalization”(TN) on Google and Github, you can hardly find open-source projects that are “read-to-use” for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that supports TN functionality. There is quite some work between “support text normalization” and “do text normalization”.
TN is language-dependent, more or less.
Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less.
TN is task-specific.
Even for the same language, different applications require quite different TN.
TN is “dirty”
Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose. Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks.
mature TN module is an asset
Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any)
TN is a less important topic for either academic or commercials.
This project sets up a ready-to-use TN module for Chinese. Since my background is speech processing, this project should be able to handle most common TN tasks, in Chinese ASR text processing pipelines.
supported NSW (Non-Standard-Word) Normalization
NSW type | raw | normalized |
---|---|---|
cardinal | 这块黄金重达324.75克 | 这块黄金重达三百二十四点七五克 |
date | 她出生于86年8月18日,她弟弟出生于1995年3月1日 | 她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日 |
digit | 电影中梁朝伟扮演的陈永仁的编号27149 | 电影中梁朝伟扮演的陈永仁的编号二七一四九 |
fraction | 现场有7/12的观众投出了赞成票 | 现场有十二分之七的观众投出了赞成票 |
money | 随便来几个价格12块5,34.5元,20.1万 | 随便来几个价格十二块五 三十四点五元 二十点一万 |
percentage | 明天有62%的概率降雨 | 明天有百分之六十二的概率降雨 |
telephone | 这是固话0421-33441122 这是手机+86 18544139121 |
这是固话零四二一三三四四一一二二 这是手机八六一八五四四一三九一二一 |
acknowledgement: the NSW normalization codes are based on Zhiyang Zhou’s work here
punctuation removal
For Chinese, it removes punctuation list collected in Zhon project, containing
'"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
'!?。。'
For English, it removes Python’s string.punctuation
multilingual English word upper/lower case conversion
since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly.
plain text, one sentence per line(.txt)
今天早饭吃了没
没吃回家吃去吧
...
plain text is default format.
Kaldi’s archive format(.ark)
KALDI_KEY_UTT001 今天早饭吃了没
KALDI_KEY_UTT002 没吃回家吃去吧
...
TN will skip first column key section, normalize latter transcription text
pass --format ark
option to switch to kaldi ark format.
table format(.tsv)
ID AUDIO TEXT
UTT01 audio/UTT01.wav 今晚8点整中央5播出2020年总决赛
...
pass --format tsv
option, normalization will apply to TEXT
field only.
note: All input text should be UTF-8 encoded.
make sure you have python3, python2.X won’t work correctly.
sh run.sh
in TN
dir, and compare raw text and normalized text.
make sure you have thrax installed, and your PATH should be able to find thrax binaries.
sh run.sh
in ITN
dir. check Makefile for grammar dependency.
Since TN is a typical “done is better than perfect” module in context of ASR, and the current state is sufficient for my purpose, I probably won’t update this repo frequently.
there are indeed something that needs to be improved:
For TN, NSW normalizers in TN dir are based on regular expression, I’ve found some unintended matches, those pattern regexps need to be refined for more precise TN coverage.
For ITN, extend those thrax rewriting grammars to cover more scenarios.
Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art. More readings about this, look for Richard Sproat and KyleGorman’s work at Google.
END