Building a Conversational AI with the Fastest LLM and Text-to-Speech Model
In this blog post, we'll explore how to create a conversational AI by combining the fastest Large Language Model (LLM) with the fastest Text-to-Speech (TTS) model. The TTS model used is from Deepgram, which is sponsored by the author for this project.
The Pieces Needed for Conversational AI
To build a conversational AI, we need three essential pieces:
- Audio: The audio input from the speaker's mouth into the computer's microphone.
- Speech-to-Text (STT) Model: Transcribes the audio into a string.
- Language Model (LLM): Processes the string and generates a response.
- Text-to-Speech Model (TTS): Converts the response back into audio.
The STT Model
For this project, we'll use Deepgram's Nova 2 model, which is the fastest and most accurate for our use case. Deepgram supports multiple Nova models, each trained for different scenarios such as meetings, phone calls, and conversational AI. The Nova 2 model also supports streaming and endpointing, which detects natural breaks in conversation.
Endp Pointing
Endp pointing sets a flag when speech is finalized, indicating the end of a conversation. This flag is essential for continuing the conversation, as it allows the app to determine when to proceed.
Demo
In the demo, we'll demonstrate a single part that transcribes our voice using the STT model. The important part is the async function called onMicActivityResult
, which handles the transcription process.
The Looping Process
The conversational AI process doesn't stop at one command; it loops back around until an exit word is spoken. The loop includes:
- Audio input
- STT model transcription
- LLM processing
- TTS conversion
- Looping back to step 1 until the exit word is spoken.
Transcript Collector and LLModel Overview
The Transcript Collector is a tool that allows users to break down long conversations into individual sentences, which can then be used for various purposes. The LLModel is a language model that can be used for natural language processing tasks, such as text-to-speech conversion.
Transcript Collector Walkthrough
The Transcript Collector uses a chunk-based approach to process conversations, where the user's input is broken down into smaller chunks. The tool can detect natural breaks in the conversation, such as pauses or sentence endings, to determine when to start a new chunk. Once a chunk is complete, the transcript collector combines the individual sentences to form a complete transcript.
LLModel Walkthrough
The LLModel is provided by Grock, a company that specializes in serving custom-designed chips for inference. The LLModel is used for natural language processing tasks, such as text-to-speech conversion and language translation. The LLModel can process large volumes of text quickly, with the ability to handle up to 526 tokens per second.
API Walkthrough
The LLModel API provides two functions: batch and streaming. The batch function processes a large amount of text in a single request, while the streaming function processes text in real-time. The batch function is useful for processing large amounts of text quickly, while the streaming function is useful for processing text in real-time.
Text-to-Speech Conversion with Deep Gram's Aura Streaming
Deep Gram's Aura Streaming is a new model that converts text to speech. The model is trained on a large dataset of audio recordings, which allows it to generate high-quality speech. The tool can be used to convert text to speech, allowing users to listen to their conversations or text in a more engaging way.
Deepgram Streaming
To reduce Time to First Data (TTFD), we send data in chunks, one at a time. We measure the distance between sending data and receiving the first chunk back (TTFB). We start playing audio as soon as the first chunk is received. Deepgram's models are quick, processing one second of data in less than 1 second.
Streaming Request
To make a streaming request, we make a POST request to the Deepgram URL for text-to-speech. We set stream
to true
and specify chunk
in the response, which receives each chunk of data. We use iterContent
to process each chunk of data as it's received.
Processing Chunks
We write each chunk of data to FFmpeg, which will play the audio. We measure performance returns, which are super linear.
Conversation Manager
We create a language model processor to keep track of chat messages. We add a little memory to the conversation via Lang chain. We use an exit word (goodbye) to exit the program. We process the transcription response and give it to the LL (Language Model). We give the LL response to the text-to-speech model and reset the transcription response after processing.
Running a Language Model Forever
To run the language model indefinitely, we use a loop that continues until the user stops it. The language model is used to chat with the user and respond to their queries. The model's latency is around 1-2 seconds, which is relatively slow compared to other language models.
Language Model Latency
The latency includes network latency, which can be affected by Wi-Fi speed. The Deepgram processing time is likely faster than the reported latency. The language model's latency can be masked by inserting filler words or silence to buy more time.
Interrupting the AI
Interrupting the AI is difficult and requires more complex code. It's a software engineering problem rather than an AI problem.
Streaming Speech into the Model
The model can be trained to predict the user's speech in real-time. The user's speech can be streamed into the model as they speak. The model can then predict the rest of the user's sentence and form a response.
Cost of Tokens and Intelligence
The cost of tokens and intelligence is decreasing to zero. Implementing real-time speech prediction could be a valuable feature in text-to-speech applications.
Conclusion
The language model's latency can be improved by optimizing the model or using filler words. Interrupting the AI is a software engineering problem rather than an AI problem. Streaming speech into the model can be a valuable feature in text-to-speech applications. By combining the fastest LLM with the fastest TTS model, we can create a conversational AI that can chat with users in real-time.