Near-Realtime audio transcription using self-hosted Whisper and WebSocket in Python/JS
VoiceStreamAI is a Python 3 -based server and JavaScript client solution that
enables near-realtime audio streaming and transcription using WebSocket. The
system employs Huggingface’s Voice Activity Detection (VAD) and OpenAI’s Whisper
model (faster-whisper being the
default) for accurate speech recognition and processing.
https://github.com/alesaccoia/VoiceStreamAI/assets/1385023/9b5f2602-fe0b-4c9d-af9e-4662e42e23df
This will not guide you in detail on how to use CUDA in docker, see for
example here.
Still, these are the commands for Linux:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
You can build the container image with:
sudo docker build -t voicestreamai .
After getting your VAD token (see next sections) run:
sudo docker volume create huggingface_models
sudo docker run --gpus all -p 8765:8765 -v huggingface_models:/root/.cache/huggingface -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' voicestreamai
The “volume” stuff will allow you not to re-download the huggingface models each
time you re-run the container. If you don’t need this, just use:
sudo docker run --gpus all -p 8765:8765 -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' voicestreamai
To set up the VoiceStreamAI server, you need Python 3.8 or later and the
following packages:
transformers
pyannote.core
pyannote.audio
websockets
asyncio
sentence-transformers
faster-whisper
Install these packages using pip:
pip install -r requirements.txt
For the client-side, you need a modern web browser with JavaScript support.
The VoiceStreamAI server can be customized through command line arguments,
allowing you to specify components, host, and port settings according to your
needs.
--vad-type
: Specifies the type of Voice Activity Detection (VAD) pipeline topyannote
) .--vad-args
: A JSON string containing additional arguments for the VADpyannote
: '{"auth_token": "VAD_AUTH_HERE"}'
)--asr-type
: Specifies the type of Automatic Speech Recognition (ASR)faster_whisper
).--asr-args
: A JSON string containing additional arguments for the ASRmodel_name
for whisper)--host
: Sets the host address for the WebSocket server (127.0.0.1
).--port
: Sets the port on which the server listens (default: 8765
).--certfile
: The path to the SSL certificate (cert file) if using secureNone
)--keyfile
: The path to the SSL key file if using secure websockets (None
)For running the server with the standard configuration:
python3 -m src.main --vad-args '{"auth_token": "vad token here"}'
You can see all the command line options with the command:
python3 -m src.main --help
client/index.html
file in a web browser.ws://localhost:8765
).Both the VAD and the ASR components can be easily extended to integrate new
techniques and use models with a different interface than HuggingFace
pipelines. New processing/chunking strategies can be added in server.py, and
used by the specific clients setting the “processing_strategy” key in the
config.
Voice Activity Detection (VAD) in VoiceStreamAI enables the system to
distinguish between speech and non-speech segments within an audio stream. The
primary purpose of implementing VAD is to enhance the efficiency and accuracy of
the speech-to-text process:
VoiceStreamAI uses a Huggingface VAD model to ensure reliable detection of
speech in diverse audio conditions.
The buffering strategy is designed to balance between near-real-time processing
and ensuring complete and accurate capture of speech segments. Here’s how
buffering is managed:
In VoiceStreamAI, each client can have a unique configuration that tailors the
transcription process to their specific needs. This personalized setup is
achieved through a messaging system where the JavaScript client sends
configuration details to the Python server. This section explains how these
configurations are structured and transmitted.
The client configuration can include various parameters such as language
preference, chunk length, and chunk offset. For instance:
language
: Specifies the language for transcription. If set to anything otherprocessing_strategy
: Specifies the type of processing for this client, achunk_length_seconds
: Defines the length of each audio chunk to be processedchunk_offset_seconds
: Determines the silence time at the end of each chunkInitialization: When a client initializes a connection with the server,
it can optionally send a configuration message. This message is a JSON object
containing key-value pairs representing the client’s preferred settings.
JavaScript Client Setup: On the demo client, the configuration is
gathered from the user interface elements (like dropdowns and input fields).
Once the Audio starts flowing, a JSON object is created and sent to the
server via WebSocket. For example:
function sendAudioConfig() {
const audioConfig = {
type: "config",
data: {
chunk_length_seconds: 5,
chunk_offset_seconds: 1,
processing_strategy: 1,
language: language,
},
};
websocket.send(JSON.stringify(audioConfig));
}
When implementing a new ASR, Vad or Buffering Strategy you can test it with:
pip install -r requirements-dev.txt
export PYANNOTE_AUTH_TOKEN=<VAD_TOKEN_HERE>
ASR_TYPE=faster_whisper python -m unittest test.server.test_server
Please make sure that the end variables are in place for example for the VAD
auth token. Several other tests are in place, for example for the standalone
ASR.
Currently, VoiceStreamAI processes audio by saving chunks to files and then
running these files through the models.
Fork and clone this repository. Install dependencies and related tools.
pip install -r requirements.txt
pip install -r requirements-dev.txt
npm install -g [email protected] [email protected]
Add your modifications to the repository and run code style checks manually,
or integrate them into your IDE/editor.
# For Python
flake8 src/ test/
black --line-length 79 src/ test/
isort src/ test/
# For JavaScript
jshint client/*.js
eslint client/*.js
Finally, push and create a pull request.
This project is open for contributions. Feel free to fork the repository and
submit pull requests.