This project is a simple implementation of live captioning and translation using Azure Cognitive Services. The project uses the Speech SDK to capture audio from the microphone, send it to the Azure Speech Service for live transcription and translation, and then display the transcribed and translated text in the browser.
Parts of the code are modified from Sample Repository for the Microsoft Cognitive Services Speech SDK.
Demo Video:
Azure.Live.Speech.Translation.Demo.mov
- Azure Speech Service subscription key and region
- Python 3.12 or later
- uv
- Live captioning
- Live translation
- Auto-detect the speaker's language with continuous language identification
- OBS integration
- Mobile mode
- TV mode
-
Clone the repository
For those who already know how to clone the repository, you can skip to the next step.
-
Install
uvUse
uvto setup the required version of Python.-
For MacOS / Linux users
curl -LsSf https://astral.sh/uv/install.sh | sh -
For Windows users
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
-
-
Setup Python and install dependencies
uv sync
-
Set the Azure Speech Service subscription key and region in the
.envfile:AZURE_SPEECH_KEY =<your-subscription-key> AZURE_SPEECH_REGION =<your-region>
-
Set the list of candidate languages and target languages for translation in the
main.pyfile:Please note that the target language should be one of the supported languages by the Azure Speech Service. You can find the list of supported languages here.
Note: If you want to transcribe speech only, set the target language to
[].config = { ... "detect_languages": ["en-US", "zh-TW", "ja-JP"], "target_languages": ["zh-Hant", "en"], ... }
Run the following command to start the application:
uv run --env-file=.env python main.py Then open the browser and go to http://127.0.0.1:3000/ to see the live caption and translation. You can also open browser in brodcast application like OBS to show the live caption and translation in your live stream.
You can select languages by setting the query as ?language=original,en in the URL. First language will be shown at bottom and second language will be shown at top.
Notice that the application will pick up the default microphone of your system.
You can also use the mobile mode by access http://127.0.0.1:3000/mobile. You can select the speaker's original language or the target language by clicking the dropdown on the top left corner.
You can also use the TV mode by access http://127.0.0.1:3000/tv. In this mode, the text will be displayed in a larger font size and the background will be black. All selected languages will be displayed in the same block.
You can select languages by setting the query as ?language=original,en in the URL.
You can run the application in client-server architecture. In this case, the server will be running on the local machine and the client will be hosted on the external server. The server will send the live caption and translation to the client using external Socket.IO.
graph LR
B[Live Translation]
C[External Socket.IO/Web Server]
B --> |emit| C
subgraph Clients
D[TV]
E[Mobile]
F[OBS]
end
C --> |broadcast| D
C --> |broadcast| E
C --> |broadcast| F
-
Set the remote socket.io server endpoint and path in the
main.pyfile and configure the room id that will receive the caption and translation from the translation service:config = { ... "socketio": {"endpoint": "http://127.0.0.1:3000", "path": "/socket.io"}, "roomid": "9d2b8c9b-6ae9-45e9-81be-8f3d4d549fdd", ... }
-
Build the client with the server URL and host it on the external server. Files will be generated in the
buildfolder:uv run --env-file=.env python main.py --build
-
Start the translation service without the server:
uv run --env-file=.env python main.py --disbale-server


