Let's build a speech-to-speech chatbot

Apply speech-to-text (STT), LLMs and text-to-speech (TTS) to replicate ChatGPT's voice assistant feature.

Feb 11, 2025

What exactly is speech-to-speech when it comes to the GenAI world?

It’s the closest we’ve got to a “human-like” interaction. It’s what we’ve tried with Google Home or Alexa a few years back, but it backfired into scheduling alarms using our voice;

Nowadays, speech-to-speech is an audio-to-audio interface for LLMs. Like ChatGPT’s voice assistant, instead of texting questions and reading the answer, you directly interact with the chatbot through voice.

Surprisingly, limited information is available on how to create such an application, and comprehensive guides are difficult to find.

Thus, in this article, we’ll guide you through building a simple speech-to-speech application, where you will learn:

speech-to-text (STT), LLMs and text-to-speech (TTS) algorithms
expose the speech-to-speech feature as a REST real-time endpoint
how to apply WebSockets to handle real-time data transfer efficiently

Understanding the Core Components of a Speech-to-Speech Application

The backbone of a speech-to-speech application consists of three essential components:

1. Speech-to-Text (STT): This component transcribes spoken words into written text, allowing the application to understand user input. Popular solutions include OpenAI’s Whisper model and Amazon Transcribe. These tools convert audio into accurate text transcripts.

2. Large Language Model (LLM): Serving as the brain of the application, an LLM generates meaningful responses based on the transcribed text. When choosing an LLM, consider one that can interact with external tools or APIs, as this capability enhances the application’s functionality by enabling more complex operations like retrieving real-time data or performing calculations.

3. Text-to-Speech (TTS): This engine transforms the LLM’s textual responses back into spoken words, completing the conversational loop. Options include Amazon Polly and various open-source models. A good TTS engine produces natural-sounding speech, making interactions more engaging and accessible.

Note: You’re not limited to paid versions of these components. Open-source alternatives, such as models from Hugging Face, offer robust STT, LLM, and TTS capabilities. Utilizing open-source tools can reduce costs and provide greater customization for your application.

Hands-On

Now that we have a clear picture of what we want to build, let’s get right into coding. In this example, I’ll be using the uv to manage Python dependencies, but you can use your own environment manager of choice (Poetry, Conda, virtual environments, etc.)

Note: For detailed steps on how to scaffold and configure your project with uv, please consult the GitHub repository’s README. This will give you all the commands and instructions for setting up the environment described in this tutorial.

Once your environment is ready, we’ll place our modules in the src/app folder and begin building the main application components.

Component Providers

You’re free to choose any providers you prefer for the three main components of this application. In this tutorial, I will use the following:

STT: whisper-large-v3-turbo from Groq
LLM: llama-3.3–70b-versatile from Groq
TTS: tts-1 from OpenAI

Note: These are examples, and you can replace them with models you have access to or prefer. Just ensure you have the necessary API keys or resources for these models.

Environment Variables

To communicate with the providers mentioned, we’ll need API keys. Additionally, we’ll need an API key to interact with the Weatherstack API. You can find instructions to obtain the keys at the following links:

Once you’ve collected the keys, we can create the .env file. It should look something like this:

GROQ_API_KEY=<your-key-here>
OPENAI_API_KEY=<your-key-here>
WEATHERSTACK_API_KEY=<your-key-here>

Note: Don’t forget to add your .env file to your .gitignore; you don't want your credentials to be available in a public repository.

Now, we’ll use pydantic-settings to access and validate our API keys. In our src/app folder, we'll create a new Python file called settings.py :

# src/app/settings.py
import os
from functools import lru_cache

from pydantic_settings import BaseSettings


class Settings(BaseSettings):
    groq_api_key: str = os.environ["GROQ_API_KEY"]
    openai_api_key: str = os.environ["OPENAI_API_KEY"]
    weatherstack_api_key: str = os.environ["WEATHERSTACK_API_KEY"]


@lru_cache
def get_settings() -> Settings:
    return Settings()

We can now access our keys programmatically and securely. For example:

# Usage example
settings = get_settings()
print(settings.openai_api_key)

Building the Speech-to-Text (STT) Component

Let’s start by building the Speech-to-Text (STT) component of our application.

If you refer to Groq documentation on implementing speech-to-text functionality, you’ll find an example like this:

import os
from groq import Groq

# Initialize the Groq client
client = Groq()

# Specify the path to the audio file
filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!

# Open the audio file
with open(filename, "rb") as file:
    # Create a transcription of the audio file
    transcription = client.audio.transcriptions.create(
      file=(filename, file.read()), # Required audio file
      model="whisper-large-v3-turbo", # Required model to use for transcription
      prompt="Specify context or spelling",  # Optional
      response_format="json",  # Optional
      language="en",  # Optional
      temperature=0.0  # Optional
    )
    # Print the transcription text
    print(transcription.text)

Looking closely, you’ll notice that we simply need an audio file to send to the Groq client with the desired parameters, and we’ll receive the transcription in return. However, since we don’t want to store all the user inputs on disk — which can be inefficient and pose security concerns — we can utilize BytesIO to handle the audio data in memory, making the process more efficient:

# src/app/stt.py
from io import BytesIO

from groq import AsyncGroq


async def transcribe_audio_data(
    audio_data: bytes,
    api_client: AsyncGroq,
    model_name: str = "whisper-large-v3-turbo",
    temperature: float = 0.0,
    language: str = "en"
) -> str:
    with BytesIO(initial_bytes=audio_data) as audio_stream:
        audio_stream.name = "audio.wav"
        response = await api_client.audio.transcriptions.create(
            model=model_name,
            file=audio_stream,
            temperature=temperature,
            language=language,
        )
        text = response.text.strip()
        return text

In this code snippet, we collect the audio bytes from the user, assign a filename to mimic an on-disk file, and send it directly to the Groq API to obtain the transcription.

Building the Large Language Model (LLM) Component

Now, let’s build the Large Language Model (LLM) component of our application.

For this component, I will use PydanticAI because it’s a lightweight framework that automates common operations when interacting with LLMs, such as handling messages, tools, and structured outputs. However, feel free to use your favorite framework or even pure Python to interact directly with the API clients. If you have the time, I highly recommend this approach, as it will give you a deeper understanding of LLM APIs.

Setting Up Pydantic AI

When using PydanticAI, we need to set up the following components:

Model (Required): Specify the LLM you plan to use. In our case, we’ll use llama-3.3–70b-versatile
Dependencies (Optional): Think of dependencies as a way to inject external resources into the agent, similar to how FastAPI injects dependencies into route functions. For example, passing an HTTP client to the agent allows it to make API requests.
System Prompt (Optional): This is the initial prompt that sets the behavior of the LLM, such as the classic “You are a helpful assistant.”
Tools (Optional): Tools extend the capabilities of the LLM, allowing it to perform specific tasks or access external data.

Creating the Agent

We’ll create a Python module that brings all these components together:

# src/app/llm.py
from dataclasses import dataclass
from typing import Literal

import aiohttp
from groq import AsyncGroq
from pydantic_ai import Agent, RunContext, Tool
from pydantic_ai.models.groq import GroqModel

from app.settings import Settings

type AvailableCities = Literal["Paris", "Madrid", "London"]


# 1. Dependencies: We can define a dataclass to hold the dependencies that our agent can use in its tools or prompts.
@dataclass
class Dependencies:
    settings: Settings
    session: aiohttp.ClientSession


# 2. Tool: We can define functions that can be used by the agent to perform specific tasks.
async def get_weather(ctx: RunContext[Dependencies], city: AvailableCities) -> str:
    url = "http://api.weatherstack.com/current"
    params = {
        "access_key": ctx.deps.settings.weatherstack_api_key,
        "query": city,
    }

    async with ctx.deps.session.get(url=url, params=params) as response:
        data = await response.json()
        observation_time = data.get("current").get("observation_time")
        temperature = data.get("current").get("temperature")
        weather_descriptions = data.get("current").get("weather_descriptions")
        return f"At {observation_time}, the temperature in {city} is {temperature}°C. The weather is {weather_descriptions[0].lower()}"


# 3. Model
def create_groq_model(
    groq_client: AsyncGroq,
) -> GroqModel:
    return GroqModel(
        model_name="llama-3.3-70b-versatile",
        groq_client=groq_client,
    )


# 4. Agent
def create_groq_agent(
    groq_model: GroqModel,
    tools: list[Tool[Dependencies]],
    system_prompt: str,  # 5. System Prompt
) -> Agent[Dependencies]:
    return Agent(
        model=groq_model,
        deps_type=Dependencies,
        system_prompt=system_prompt,
        tools=tools,
    )

Some highlights:

Data Classes and Dependencies: We define a data class that specifies the objects stored in the agent’s context. In this case, we want the agent to have access to an HTTP client for making requests and an instance of the Settings class that contains our credentials. These dependencies are not provided to the LLM in the prompts but are used externally to support the agent's functionality.
Tool Definition: In the tool definition, we expect the model to retrieve accurate information based on the city provided. The context is passed using the dependencies. In this example, we specify that the model can only fetch information for specific cities. This is just a sample implementation — you are free to code the tool function as you like.
Model Initialization: When creating the model, we can initialize it directly with the API key instead of the client (PydanticAI handles the creation of the client internally). However, providing a client instance can avoid multiple initializations, making your code more efficient.

If any of this is unclear, I highly recommend checking out the PydanticAI documentation. It’s a valuable resource that can enhance your understanding and help you make the most of this framework in your applications.

Building the Text-to-Speech (TTS) Component

Finally, we will build what is probably the most challenging component: the Text-to-Speech (TTS) system.

If we aim for real-time interactions, we need to produce audio as soon as the LLM (Large Language Model) generates tokens. Otherwise, we’ll have to wait for the full chat completion before generating the audio. There’s nothing inherently wrong with the latter approach, but the user experience will be “near” real-time, meaning users will need to wait longer before hearing the response.

Note: OpenAI offers a realtime API that integrates both the LLM and TTS components of our application. You could use it to save some coding effort; however, only the gpt-4o and gpt-4o-mini models are available.

If we were willing to wait for the full generation, we could use something like this:

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

However, since we want to stream audio as soon as we receive tokens from the LLM, we need to take a different approach. We have to be very careful here because the streaming of tokens and the streaming of audio are interrelated, and we might end up with tightly coupled code. We want to avoid this, and one way to achieve that is by using a context manager.

# src/app/tts.py
class TextToSpeech:
    def __init__(
        self,
        client: AsyncOpenAI,
        model_name: str,
        voice: Voice = "echo",
        response_format: ResponseFormat = "aac",
        speed: float = 1.00,
        buffer_size: int = 64,
        sentence_endings: tuple[str, ...] = (
            ".",
            "?",
            "!",
            ";",
            ":",
            "\n",
        ),
        chunk_size: int = 1024 * 5,
    ) -> None:
        self.client = client
        self.model_name = model_name
        self.voice: Voice = voice
        self.response_format: ResponseFormat = response_format
        self.speed = speed
        self.buffer_size = buffer_size
        self.sentence_endings = sentence_endings
        self.chunk_size = chunk_size
        self._buffer = ""

    async def __aenter__(self) -> "TextToSpeech":
        """
        Enters the asynchronous context manager.

        Returns:
            The TextToSpeech instance.
        """
        return self

    async def feed(self, text: str) -> AsyncIterator[bytes]:
        """
        Feeds text into the buffer and yields audio bytes if the buffer reaches the buffer size
        or ends with a sentence-ending character.

        Args:
            text: The text to add to the buffer.

        Yields:
            Audio bytes generated from the buffered text.
        """
        self._buffer += text
        if len(self._buffer) >= self.buffer_size or any(
            self._buffer.endswith(se) for se in self.sentence_endings
        ):
            async for chunk in self.flush():
                yield chunk

    async def flush(self) -> AsyncIterator[bytes]:
        """
        Flushes the buffered text and yields the resulting audio bytes.

        Yields:
            Audio bytes generated from the buffered text.
        """
        if self._buffer:
            async for chunk in self._send_audio(self._buffer):
                yield chunk
            self._buffer = ""

    async def _send_audio(self, text: str) -> AsyncIterator[bytes]:
        """
        Sends text to the TTS API and yields audio chunks.

        Args:
            text: The text to convert to speech.

        Yields:
            Chunks of audio bytes generated from the input text.
        """
        async with self.client.audio.speech.with_streaming_response.create(
            model=self.model_name,
            input=text,
            voice=self.voice,
            response_format=self.response_format,
            speed=self.speed,
        ) as audio_stream:
            async for audio_chunk in audio_stream.iter_bytes(
                chunk_size=self.chunk_size
            ):
                yield audio_chunk

    async def __aexit__(
        self,
        exc_type: type[BaseException] | None,
        exc_value: BaseException | None,
        exc_tb: TracebackType | None,
    ) -> None:
        """
        Exits the asynchronous context manager.
        """
        pass

The idea is to create a buffer that stores the tokens generated by the LLM until a certain point — either when the buffer is full or it ends with a specific character. Once we’ve accumulated enough tokens, the buffer is flushed and sent to the OpenAI Text-to-Speech client to produce audio bytes.

The great thing about this approach is that we haven’t coupled the text generation streaming process with the audio generation streaming process. As you can see, the TextToSpeech component doesn't need to know anything about the LLM; it just expects some strings, regardless of their source.

Bringing It All Together: Building the API and Testing the Application

Now that we have the code for our three components, our src/app folder should look something like this:

src/app/
├── __init__.py
├── llm.py
├── settings.py
├── stt.py
└── tts.py

With the components in place, we need to create our FastAPI-powered application.

Application Lifespan

Recall our earlier discussion about reusing the Groq client across the application. We’re going to extend that approach by reusing all the clients in our application. This strategy allows us to manage our resources efficiently and avoid creating multiple unnecessary connections.

FastAPI’s lifespan events enable us to execute certain operations at startup and shutdown, making them the ideal place to establish connections to our clients and close them appropriately. Here’s how we can implement this:

# src/app/lifespan.py
from contextlib import asynccontextmanager
from typing import AsyncIterator, TypedDict

import aiohttp
from fastapi import FastAPI
from groq import AsyncGroq
from loguru import logger
from openai import AsyncOpenAI
from pydantic_ai import Agent, Tool
from pydantic_ai.models.groq import GroqModel

from app.llm import Dependencies, create_groq_agent, get_weather
from app.settings import Settings, get_settings


def create_aiohttp_session() -> aiohttp.ClientSession:
    return aiohttp.ClientSession()


def create_groq_client(
    settings: Settings,
) -> AsyncGroq:
    return AsyncGroq(api_key=settings.groq_api_key)


def create_openai_client(
    settings: Settings,
) -> AsyncOpenAI:
    return AsyncOpenAI(api_key=settings.openai_api_key)


def create_groq_model(
    groq_client: AsyncGroq,
) -> GroqModel:
    return GroqModel(
        model_name="llama-3.3-70b-versatile",
        groq_client=groq_client,
    )


class State(TypedDict):
    aiohttp_session: aiohttp.ClientSession
    groq_client: AsyncGroq
    openai_client: AsyncOpenAI
    groq_agent: Agent[Dependencies]


@asynccontextmanager
async def app_lifespan(app: FastAPI) -> AsyncIterator[State]:
    settings = get_settings()
    aiohttp_session = create_aiohttp_session()
    openai_client = create_openai_client(settings=settings)
    groq_client = create_groq_client(settings=settings)
    _groq_model = create_groq_model(groq_client=groq_client)
    groq_agent = create_groq_agent(
        groq_model=_groq_model,
        tools=[Tool(function=get_weather, takes_ctx=True)],
        system_prompt=(
            "You are a helpful assistant. "
            "You interact with the user in a natural way. "
            "You should use `get_weather` ONLY to provide weather information."
        ),
    )

    yield {
        "aiohttp_session": aiohttp_session,
        "openai_client": openai_client,
        "groq_client": groq_client,
        "groq_agent": groq_agent,
    }

    logger.info("Closing aiohttp session")
    await aiohttp_session.close()

    logger.info("Closing OpenAI client")
    await openai_client.close()

    logger.info("Closing Groq client")
    await groq_client.close()

In this setup, everything inside the app_lifespan function before the yield executes when we start the application. Conversely, everything after the yield runs when the application is shutting down. This approach allows us to reuse all the artifacts of our application—including settings and the agent—throughout its lifecycle. Additionally, by storing these artifacts in the lifespan state, we can access them anywhere within the app.

FastAPI and WebSocket

Next, we need to build the FastAPI application and the WebSocket to interact with the components we’ve created. We’ll put all this logic in a file called server.py at the root of the project.

# server.py
from fastapi import Depends, FastAPI, WebSocket
from groq import AsyncGroq
from pydantic_ai import Agent

from app.lifespan import app_lifespan as lifespan
from app.llm import Dependencies
from app.settings import get_settings
from app.stt import transcribe_audio_data
from app.tts import TextToSpeech

app = FastAPI(title="Voice to Voice Demo", lifespan=lifespan) # 1. Lifespan


# ------------- 2. FastAPI Dependency Injection -------------
async def get_agent_dependencies(websocket: WebSocket) -> Dependencies:
    return Dependencies(
        settings=get_settings(),
        session=websocket.state.aiohttp_session,
    )


async def get_groq_client(websocket: WebSocket) -> AsyncGroq:
    return websocket.state.groq_client


async def get_agent(websocket: WebSocket) -> Agent:
    return websocket.state.groq_agent


async def get_tts_handler(websocket: WebSocket) -> TextToSpeech:
    return TextToSpeech(
        client=websocket.state.openai_client,
        model_name="tts-1",
        response_format="aac",
    )


# ------------- 3. WebSocket -------------
@app.websocket("/voice_stream")
async def voice_to_voice(
    websocket: WebSocket,
    groq_client: AsyncGroq = Depends(get_groq_client),
    agent: Agent[Dependencies] = Depends(get_agent),
    agent_deps: Dependencies = Depends(get_agent_dependencies),
    tts_handler: TextToSpeech = Depends(get_tts_handler),
):
    """
    WebSocket endpoint for voice-to-voice communication.

    - Receives audio bytes from the client
    - Transcribes the audio to text
    - generates a response using the language model agent
    - converts the response text to speech, and streams the audio bytes back to the client.

    Args:
        websocket: WebSocket connection.
        conversation_id: Unique identifier for the conversation (dependency).
        db_conn: Asynchronous database connection (dependency).
        groq_client: Groq API client for transcription (dependency).
        agent: Language model agent for generating responses (dependency).
        agent_deps: Dependencies for the agent (dependency).
        tts_handler: Text-to-Speech handler for converting text to audio (dependency).
    """
    await websocket.accept()

    async for incoming_audio_bytes in websocket.iter_bytes():
        # Step 1: Transcribe the incoming audio
        transcription = await transcribe_audio_data(
            audio_data=incoming_audio_bytes,
            api_client=groq_client
        )

        # Step 2: Generate the agent's response
        generation = ""
        async with tts_handler:
            async with agent.run_stream(
                user_prompt=transcription,
                deps=agent_deps
            ) as result:
                async for message in result.stream_text(delta=True):
                    generation += message

                    # Stream the audio back to the client
                    async for audio_chunk in tts_handler.feed(text=message):
                        await websocket.send_bytes(data=audio_chunk)

            # Flush any remaining audio chunks
            async for audio_chunk in tts_handler.flush():
                await websocket.send_bytes(data=audio_chunk)

The code above accomplishes the following:

Creating the FastAPI application and passing the lifespan that we created.

Creating the FastAPI Application: We initialize the FastAPI app and pass in the lifespan function we created.
Defining Dependency Injection Functions: These functions allow us to access the objects stored in the lifespan state within the WebSocket.
Implementing the WebSocket Endpoint: This WebSocket orchestrates all the components and performs the following steps:

Receiving Audio Bytes: It receives audio bytes from the client.
Transcribing Audio to Text: It uses the STT component to transcribe the audio.
Generating a Response: It generates a response using the LLM Agent.
Converting Text to Speech: It converts the response text to speech using the TTS component.
Streaming Audio Back: It streams the audio bytes back to the client.

Testing the Application

To test our WebSocket, we can build a simple user interface (UI) using HTML and JavaScript. I’m not a JavaScript developer, so I used ChatGPT to generate simple code for the UI (the code is available here). In real-world scenarios, this part might be handled by another team. I stored the code in an .html file at the root folder of the project

Note: You can also test the WebSocket using Python, but it may require additional setup with PyAudio.

To serve the HTML file, we can add a new endpoint to our FastAPI application that reads from the file and generates a simple UI:

# server.py
from pathlib import Path

from fastapi import Depends, FastAPI, WebSocket
from fastapi.responses import HTMLResponse
from groq import AsyncGroq
from pydantic_ai import Agent

from app.lifespan import app_lifespan as lifespan
from app.llm import Dependencies
from app.settings import get_settings
from app.stt import transcribe_audio_data
from app.tts import TextToSpeech

app = FastAPI(title="Voice to Voice Demo", lifespan=lifespan)  # 1. Lifespan


@app.get("/")
async def get():
    with Path("sample_ui.html").open("r") as file:
        return HTMLResponse(file.read()) # This will render the HTML

...

With this setup, we’re ready to serve our application. As a final check, your project folder structure should look like this:

├── README.md
├── pyproject.toml
├── sample_ui.html
├── server.py
├── src
│   └── app
│       ├── __init__.py
│       ├── lifespan.py
│       ├── llm.py
│       ├── settings.py
│       ├── stt.py
│       └── tts.py
└── uv.lock

Finally, we need to launch the application and interact with it:

 uv run uvicorn server:app \
  --host 0.0.0.0 \
  --port 8000 \
  --reload

If everything is working correctly, you’ll see the UI at http://localhost:8000

Wrap-Up

Congratulations! You now have your own speech-to-speech application. However, this is a very lightweight implementation. As you may have noticed, there are several areas where this application can be improved, including but not limited to:

Better Separation of Concerns: Organizing each Python module for clarity and maintainability.
Complete Docstrings: Adding comprehensive documentation to your code.
Database Integration: Implementing a database to store data and allow for multi-turn conversations.
Containerization: Using tools like Docker to containerize the application for easier deployment.
Unit Tests: Writing tests to ensure your code functions as expected.
Fine-Tuning Buffer Size: Adjusting the buffer size for more natural conversations.
Prompt Optimization: Refining prompts for better LLM responses.
Improved UI: Enhancing the user interface for a better user experience.

If you’re interested in a more robust version of this application, I recommend visiting my GitHub repository, where I’ve implemented points 1, 2, 3, and 4.

This concludes our tutorial on building a simple speech-to-speech application. I hope this guide has been helpful, and I encourage you to explore and expand upon this foundation to create even more advanced applications.

About the Author

I’m Juan Ovalle, an AI Engineer and Data Scientist passionate about building production-grade applications. My current work focuses on Generative AI applications, and I love exploring new technologies that make our lives as developers easier.

You can find more about my projects on GitHub and LinkedIn.

Whenever you’re ready, there are 3 ways we can help you:

Perks: Exclusive discounts on our recommended learning resources
(live courses, self-paced courses, learning platforms and books).
The LLM Engineer’s Handbook: Our bestseller book on mastering the art of engineering Large Language Models (LLMs) systems from concept to production.
Free open-source courses: Master production AI with our end-to-end open-source courses, which reflect real-world AI projects, covering everything from system architecture to data collection and deployment.