deepgram python sdk: Getting Started with Live Transcriptions using Deepgram Speech Recognition API

deepgram python sdk: Learn how to implement live transcriptions in the browser using Deepgram's Speech Recognition API. This step-by-step tutorial covers requesting access to the user's microphone, establishing a connection with Deepgram, and displaying live transcriptions in real-time.

October 19, 2024 at 10:59

Getting Started with Live Transcriptions using Deepgram Speech Recognition API

Live transcriptions have become an essential feature in many applications, providing users with real-time text representations of spoken audio. In this tutorial, we will explore how to get started with live transcriptions using Deepgram's Speech Recognition API in the browser.

Step 1: Request Access and Get Data from User's Microphone

To begin, we need to request access to the user's media device, specifically an audio device like a microphone. We can achieve this using the built-in API in most browsers, which returns a promise that resolves to a MediaStream object. By console logging the result, we can see what it contains.

Step 2: Create a Persistent Two-Way Connection with Deepgram

Next, we need to establish a real-time connection with Deepgram that allows for sending and receiving data. This connection enables the exchange of audio data from the user's microphone to Deepgram's speech recognition API.

Step 3: Send Audio Data to Deepgram and Receive Live Transcriptions

Once the connection is established, we can send the audio data from the user's microphone to Deepgram as soon as it becomes available. Deepgram analyzes the audio data in real-time and returns live transcriptions of the audio.

Step 4: Display Live Transcriptions in the Browser Console

Finally, we need to listen for live transcriptions being returned from Deepgram and display them in the browser console for the user to see. This allows the user to see the live transcription of the audio in real-time as it is being spoken.

Media Stream and Deepgram Integration

Step 1: Requesting Access to the Microphone

The browser handles requesting access to the microphone for us. Once allowed, the media stream is logged.

Step 2: Plugging in the Media Stream to a Media Recorder

We create a new MediaRecorder instance and plug in the media stream to the MediaRecorder. We specify the output format that we desire.

Step 3: Establishing a Persistent Two-Way Connection with Deepgram

We create a new WebSocket instance and connect directly to Deepgram's Live Transcription Endpoint. We provide authentication details, such as an API key, directly.

Step 4: Preparing and Sending Data from the Mic

We hook into the socket.onopen event and prepare and send data from the microphone as soon as the connection is opened.

Setting up the Media Recorder

We need to add an event listener to the media recorder to capture the audio data from the microphone. The event listener should listen for the "dataavailable" event, which will return the data from the microphone. We need to start the media recorder to make the data available. We set the time slice to 250 milliseconds, which means the data will be packaged up and made available every quarter of a second.

Sending Data to Deepgram

We need to send the audio data to Deepgram for processing. The data will be sent through the "dataavailable" event. The media recorder will package up the data and make it available for us to send.

Listening for Messages from Deepgram

We need to listen for messages that are being sent from Deepgram to us. We listen for the "onmessage" event. The returned payload contains useful data, which we can extract and use. We extract the transcript from the returned payload and console log it.

Browser Live Transcription

To start live transcription in the browser, you need to show it to users or do something else with it, but that's all you need to do live transcription in the browser. You need to give access to our microphone and you should see the transcripts appearing right there in your console within a minute. You'll see that there are multiple phrases coming for everything you're saying. There is an additional property in the returned payload that indicates when a given phrase is in its final form.

Best Practices for Handling API Key

We published a blog post on best practices for handling API key recently, which talks about how to do it. It's important to check the description for that and make sure you're protecting your API key from being accessible to users and having too wide-reaching permissions. Make sure you're doing something to protect your API key from being accessible to users and having too wide-reaching permissions.

Future Development

If you have any questions at all, feel free to reach out to us. We love to help people and see what they're going to build with our speech recognition API. Have a wonderful day! I'm happy to help!