Reachy Mini 완전한 로컬 실행

Reachy Mini goes fully local

Reachy Mini이 완전히 로컬화됩니다

Reachy Mini를 구축한 후,

을 설치하고 대화를 시작합니다. 지금까지는 오디오를 서버로 전송해야 했습니다. 하지만 더 이상은 아닙니다. 오늘은 전체 스택을 로컬에서 실행하는 방법을 안내해드리겠습니다.

이 스택은 speech-to-speech, 즉 Realtime API 호환 /v1/realtime WebSocket을 제공하는 VAD → STT → LLM → TTS 캐스케이드 파이프라인으로 구동됩니다. 백엔드를 시작한 후 UI에서 로봇을 가리키면 됩니다.

캐스케이드는 현재 오픈소스 생태계에서 가장 유연한 옵션이며, 올바른 구성 요소를 사용하면 가장 빠릅니다. 우리가 선호하는 구성 요소를 추천하겠지만, 캐스케이드의 핵심은 이들을 교환할 수 있다는 것입니다. 매주 새로운 모델이 출시됩니다.

요약

Reachy Mini를 위한 로컬 음성 백엔드를 배포합니다.

캐스케이드 방식인 speech-to-speech 라이브러리를 사용합니다.

권장: llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT 0.6B v3 STT, Qwen3-TTS.

빠른 시작

이 블로그는 Reachy Mini와의 대화를 완전히 로컬에서 실행하는 방법을 설명합니다. 클라우드 없음, API 키 없음, 머신에서 데이터가 나가지 않습니다. 다음은 이것을 실제로 보여주는 비디오입니다:

LLM을 로컬에서 제공하기

LLM을 제공하기 위해 Hugging Face의 llama.cpp를 사용합니다. 설치해야 하는 경우 가장 간단한 방법은 brew install llama.cpp 또는 winget install llama.cpp입니다. 자세한 내용은 문서를 확인하세요. 먼저 다음을 실행합니다:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

완료! 처음에는 모델을 다운로드하고, 그 이후 실행은 빠릅니다.

이 플래그들은 무엇을 하나요?

-hf ggml-org/gemma-4-E4B-it-GGUF — Hub에서 직접 모델을 가져옵니다. 처음 실행은 다운로드하고, 그 이후 실행은 캐시를 사용합니다.
-np 2 — 2개의 병렬 슬롯. 서버가 첫 번째 요청을 차단하지 않고 두 번째 요청(예: 빠른 중단)을 처리할 수 있게 합니다.
-c 65536 — 64k 컨텍스트 윈도우, 슬롯 전체에서 공유. 긴 대화를 위한 충분한 여유 공간.
-fa on — flash attention. 더 빠르고 메모리 사용량이 적으며, 최신 하드웨어에서는 거의 무료입니다.
--swa-full — 슬라이딩 윈도우 어텐션 캐시를 다시 계산하는 대신 전체를 유지합니다. RAM을 약간 사용하는 대신 Gemma의 프롬프트 처리를 눈에 띄게 더 빠르게 합니다.

speech-to-speech 설정하기

라이브러리를 간단히 설치하는 것으로 시작합니다

uv pip install speech-to-speech

그러면 다른 터미널에서 LLM을 제공하는 동안, 다음을 실행할 수 있습니다:

speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local

터미널을 통해 모델과 대화를 시작할 수 있습니다! 처음에는 Parakeet-TDT 0.6B v3와 Qwen3TTS를 다운로드해야 하지만, 그 이후 실행은 빠릅니다.

로컬 대화 모드를 보여주는 비디오는 다음과 같습니다:

이제 --mode local로 시도해본 후, 그 옵션 없이 명령을 다시 실행하여 로봇에 speech-to-speech를 제공할 수 있습니다.

Reachy Mini를 speech-to-speech에 연결하기

llama.cpp와 speech-to-speech가 실행 중이면, 데스크톱 앱으로 로봇을 시작하고 대화 앱을 실행할 수 있습니다. 대화 앱의 UI에서, HF 백엔드에서 "edit connection"을 클릭하여 로컬 모드를 선택해야 합니다. 다음은 이를 수행하는 방법을 보여주는 비디오입니다:

완료되었습니다. 로봇과 대화를 시작할 수 있습니다. 파이프라인의 각 단계는 트레이드오프입니다: 더 빠르지만 품질이 낮은 TTS 모델이 있고, 더 느리지만 품질이 높은 STT 모델이 있습니다. 우리는 다국어를 최적화했지만, 단일 언어를 최적화하려는 경우도 있을 수 있습니다. 블로그의 나머지 부분에서는 커스터마이징 방법을 다룹니다.

더 깊게 살펴보기

자신의 Speech-to-Speech 서버를 왜 실행할까요?

호스팅되는 realtime 백엔드는 편리하지만, 자신의 엔진을 실행하면 세 가지를 얻을 수 있습니다:

프라이버시. 오디오가 네트워크를 벗어나지 않으며, 전체 파이프라인이 제어하는 하드웨어에서 실행됩니다.
API 비용 없음. 분당 또는 토큰당 요금이 없습니다.
파이프라인에 대한 완전한 제어. VAD, STT, LLM, TTS 중 어떤 것도 교환할 수 있습니다. Hub에 무언가 더 나은 것이 내려올 때마다 🤗.

speech-to-speech repo는 단일 CLI에서 이 모든 것을 제공합니다. /v1/realtime에서 WebSocket 서버를 부팅하여 Reachy Mini가 이미 대화하는 방식과 동일한 프로토콜을 사용합니다.

우리의 편향된 기본값: VAD, STT, TTS

캐스케이드된 음성 파이프라인에는 4개의 단계가 있습니다: VAD, STT, LLM, TTS. 이 중 3개에 대해, LLM에 집중할 수 있도록 견고한 기본값을 선택합니다:

단계	선택	이유
VAD	Silero VAD v5	작고, 정확하며, CPU에서 실행됩니다. 오픈소스 음성 에이전트 세계에서 사실상의 표준.
STT	Parakeet-TDT 0.6B v3	스트리밍 친화적, 매우 빠르며, 영어로 높은 품질.
TTS	Qwen3-TTS	표현력 있고, 낮은 지연시간, 다국어, 커스텀 음성 지원.

우리는 이러한 선택에 대해 편향되어 있으며, 자신의 선호도가 있으면 이들을 자유롭게 교환하세요.

LLM 선택하기

LLM은 시스템의 지연시간과 전체 성능에 가장 큰 영향을 미치는 계층입니다. 우리는 두 가지 옵션을 지원합니다: 로컬에서 모델 실행하기 (llama.cpp, MLX, Transformers, vLLM) 또는 Responses API가 있는 서버 사용하기 (OpenAI, Gemini, HF Inference Endpoints, llama.cpp, vLLM 등).

Responses API: 음성 루프에서 뇌를 분리하기

시스템의 주요 병목은 LLM 추론 지연시간입니다. 이를 해결하기 위해, Responses API 프로토콜을 통해 노출되는 외부 추론 엔진을 지원합니다.

speech-to-speech 엔진은 따라서 LLM이 Responses API 프로토콜을 사용하는 한 별도 프로세스에서 살아있는 두 번째 모드를 지원합니다. 한 터미널에서 모델 서버를 실행하고, 다른 터미널에서 음성 루프를 실행하며, 둘이 HTTP를 통해 통신합니다.

옵션 1: 한 터미널에서 llama.cpp, 다른 터미널에서 speech-to-speech

터미널 1: llama.cpp 서버:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

터미널 2: speech-to-speech 클라이언트:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "unsloth/Qwen3-4B-Instruct-2507-GGUF" \
  --responses_api_base_url "http://127.0.0.1:8080/v1"

옵션 2: 한 터미널에서 vLLM, 다른 터미널에서 speech-to-speech

vLLM ≥ 0.21.0 필요. speech-to-speech 백엔드가 사용하는 도구 호출 스트리밍을 포함한 Responses API 프로토콜에 대한 전체 지원은 vLLM 0.21.0에 도입되었습니다. 더 오래된 버전은 부팅되지만 어시스턴트가 도구를 호출하려고 하면 즉시 오류가 발생합니다.

이 파이프라인을 위해 vLLM을 통해 모델을 제공할 때, 3개의 플래그가 효과적으로 필수입니다:

--enable-auto-tool-choice
--tool-call-parser <tool_parser_name> — 모델의 원시 출력을 구조화된 도구 호출로 변환하는 패밀리별 파서를 선택합니다 (예: Qwen3 지시 모델의 경우 qwen3_coder, Llama 3의 경우 llama3_json, Hermes 스타일 모델의 경우 hermes, ...).
--default-chat-template-kwargs '{"enable_thinking":false}' : 이를 지원하는 모델에 대해 <think> 추론 채널을 비활성화합니다. 더 어려운 에이전트 작업의 경우 이를 true로 전환하고 모델이 추론하도록 할 수 있지만, 자연스러운 대화를 위해서는 이를 비활성화로 유지할 것을 강력히 권장합니다: 모든 생각 토큰은 로봇이 말하기 시작하기 전에 사용자가 침묵으로 듣는 지연시간입니다.

터미널 1: vLLM 추론 서버 (Qwen/Qwen3-4B-Instruct-2507):

vllm serve Qwen/Qwen3-4B-Instruct-2507 \
  --port 8000 \
  --host 127.0.0.1 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --default-chat-template-kwargs '{"enable_thinking":false}' \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'

--speculative-config 라인은 Multi-Token Prediction(MTP)을 활성화합니다. 이는 선택 사항이지만, 종단간 지연시간에 큰 영향을 미칩니다. 모델이 지원할 때마다 활성화를 유지하세요.

터미널 2: speech-to-speech 클라이언트:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3-4B-Instruct-2507" \
  --responses_api_base_url "http://127.0.0.1:8000/v1"

옵션 3: Hugging Face Inference Endpoints

동일한 프로토콜이지만 모델은 Hugging Face의 관리형 GPU에서 실행됩니다. 채팅 모델을 Inference Endpoint로 배포한 후, 음성 루프를 엔드포인트 URL로 가리킵니다:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3-4B-Instruct-2507" \
  --responses_api_base_url "https://<your-endpoint>.endpoints.huggingface.cloud/v1" \
  --responses_api_api_key "$HF_TOKEN"

옵션 4: Hugging Face Inference Providers

자신의 엔드포인트를 관리하지 않으려면 Inference Provider를 사용합니다. Hugging Face는 단일 URL을 사용하여 제3자 백엔드(예: Together, Fireworks, Replicate)로 요청을 라우팅합니다:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3.6-35B-A3B:deepinfra" \
  --responses_api_base_url "https://router.huggingface.co/v1" \
  --responses_api_api_key "$HF_TOKEN"

옵션 5: OpenAI (또는 OpenAI 호환 제공자)

최첨단 모델에 대해 0개의 인프라로 테스트하려면, 동일한 플래그를 OpenAI로 가리킵니다:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "gpt-5.4" \
  --responses_api_api_key "$OPENAI_API_KEY"

--responses_api_* 플래그는 프로토콜을 구현하는 모든 제공자(OpenRouter, Together, Fireworks, …)에 동일하게 작동합니다. 기본 URL과 API 키를 교환하고, 나머지 파이프라인은 동일하게 유지합니다.

LLM을 인프로세스로 실행하기

옵션 1: MLX에서 로컬 LLM (Apple Silicon)

Mac을 사용하는 경우, MLX는 합리적인 지연시간으로 실제 모델을 실행하는 가장 간단한 방법입니다. 우리는 M 시리즈 칩에서 즉각적인 느낌을 받을 수 있을 정도로 작으면서도 대화를 유지할 수 있을 만큼 충분히 능력 있는 Qwen3-4B-Instruct-2507을 권장합니다.

speech-to-speech \
  --llm_backend mlx-lm \
  --model_name "mlx-community/Qwen3-4B-Instruct-2507-bf16"

서버는 기본적으로 ws://127.0.0.1:8765/v1/realtime에서 수신합니다. 이를 실행 상태로 유지하고, 대화 앱을 로컬 백엔드에 연결하면, 로봇과 대화합니다.

옵션 2: Transformers에서 로컬 LLM (CUDA / CPU / MPS)

동일한 아이디어이지만 vanilla transformers를 사용합니다. CUDA 박스에 있거나, Linux에 있거나, MLX에 대해 가중치를 다시 변환하지 않고 모델을 자유롭게 교환하려는 경우에 사용합니다.

speech-to-speech \
  --llm_backend transformers \
  --model_name "Qwen/Qwen3-4B-Instruct-2507"

팁. Qwen3-4B-Instruct-2507은 단일 소비자 GPU에서 좋은 속도/품질 균형을 제공하기 때문에 LLM에 또 다른 좋은 옵션입니다. 백엔드가 지원하는 모든 HF 모델(예: 더 큰 Gemma, Qwen 또는 Mistral)로 --model_name을 가리킬 수 있습니다.

엔진을 랩톱에서, 앱을 로봇에서 실행하기

음성 엔진을 랩톱에서 실행하고 대화 앱을 Reachy Mini Wireless에서 실행하는 경우, 변경되는 유일한 것은 URL입니다. 엔진이 LAN 주소에 바인딩되도록 해야 합니다 (127.0.0.1만 아님) 그리고 UI에서 IP를 선택할 때 로봇에서 랩톱의 IP를 사용합니다.

IP를 모르는 경우, 다음을 찾는 방법입니다:

macOS

ipconfig getifaddr en0    
ipconfig getifaddr en1

Linux

hostname -I

Windows

ipconfig

활성 어댑터 아래에서 "IPv4 주소"를 찾습니다.

192.168.x.x 또는 10.x.x.x 중 하나를 원합니다. 169.254.x.x를 보면, 실제로 네트워크에 있지 않습니다.

마무리

이제 완전히 로컬 음성 루프를 갖추었습니다:

Silero로 수신하는 로봇,
Parakeet-TDT 0.6B v3으로 전사하고,
선택한 LLM으로 생각하며, 로컬 MLX, 로컬 Transformers, 옆 옆에 있는 vLLM 또는 llama.cpp 서버, 또는 호스팅 Responses API 엔드포인트 중 어느 것이든,
Qwen3-TTS로 답합니다.

Star huggingface/speech-to-speech 그리고 pollen-robotics/reachy_mini_conversation_app, 그리고 로봇에서 실행 중인 오픈소스 캐스케이드가 무엇인지 토론에서 알려주세요.

Reachy Mini goes fully local

After building your Reachy Mini, you'll install the

conversation app

and start talking to it. Until now, you had to send your audio to a server. But not anymore. Today we'll walk you through running the whole stack locally.

This stack is powered by speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket. Once you launch the backend, point the robot at it from the UI.

Cascades are the most flexible option in the open-source landscape today, and with the right pieces they're also the fastest. We'll recommend the components we like best, but the whole point of a cascade is that you can swap them. New models drop every week.

TL;DR

Deploy a local speech backend for your Reachy Mini.

We use our speech-to-speech library, a cascade approach.

Recommended: llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT 0.6B v3 STT, Qwen3-TTS.

Quick start

This blog walks you through running conversations with Reachy Mini fully locally. No cloud, no API keys, no data leaving your machine. Here's a video showing this live:

Locally serving the LLM

To serve the LLM, we'll use Hugging Face's llama.cpp. If you need to install it, the simplest way is brew install llama.cpp or winget install llama.cpp, for more help, check the docs. First, we'll run:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

And done! The first time it will download the model, subsequent launches are fast.

What do those flags do?

-hf ggml-org/gemma-4-E4B-it-GGUF — pulls the model straight from the Hub. First run downloads it, subsequent runs use the cache.
-np 2 — two parallel slots. Lets the server handle a second request (e.g. a quick interruption) without blocking on the first.
-c 65536 — 64k context window, shared across slots. Plenty of headroom for long conversations.
-fa on — flash attention. Faster and lower memory, basically free on modern hardware.
--swa-full — keeps the full sliding-window attention cache instead of recomputing it. Trades a bit of RAM for noticeably faster prompt processing on Gemma.

Setting up speech-to-speech

We'll begin by simply installing the library

uv pip install speech-to-speech

Then, while we are serving the LLM in another terminal, we can simply run:

speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local

And you can start talking to the model through your terminal! The first time it will need to download Parakeet-TDT 0.6B v3 and Qwen3TTS, but subsequent launches are fast.

Here's a video showing the local conversation mode:

Now, after you've tried it in --mode local, you can run again the command without that option to serve speech-to-speech to the robot.

Connecting Reachy Mini to speech-to-speech

Once you have llama.cpp and speech-to-speech running, you can start the robot with the desktop app and launch the conversation app. In the UI from the conversation app, you need to choose the local mode by clicking on "edit connection" in the HF backend. Here's a video showing how to do it:

And you're done. You can start talking to your robot. Every stage of the pipeline is a trade-off: there are faster TTS models with lower quality, slower STT models with higher quality. We optimized for multilingual, you might want to optimize for a single language. The rest of the blog covers how to customize.

Going deeper

Why run your own Speech-to-Speech server?

Hosted realtime backends are convenient, but running your own engine unlocks three things:

Privacy. Audio never leaves your network, the entire pipeline runs on hardware you control.
No API costs. No per-minute or per-token fees.
Full control over the pipeline. Swap any piece: VAD, STT, LLM, TTS. Whenever something better lands on the Hub 🤗.

The speech-to-speech repo gives you all of that in a single CLI. It boots a WebSocket server at /v1/realtime that speaks the same protocol Reachy Mini already knows how to talk to.

Our opinionated defaults: VAD, STT, TTS

A cascaded voice pipeline has four stages: VAD, STT, LLM, and TTS. For three of them, we pick solid defaults so you can focus on the LLM:

Stage	Choice	Why
VAD	Silero VAD v5	Tiny, accurate, runs on CPU. The de-facto default in the open-source voice-agent world.
STT	Parakeet-TDT 0.6B v3	Streaming-friendly, very fast, great quality on English.
TTS	Qwen3-TTS	Expressive, low-latency, multilingual, supports custom voices.

We are opinionated about these choices, feel free to swap them out for your own if you have a preference.

Choosing your LLM

The LLM is the layer with the most impact on latency and overall performance of the system. We support two options: run a model locally (llama.cpp, MLX, Transformers, vLLM), or use a server with a Responses API (OpenAI, Gemini, HF Inference Endpoints, llama.cpp, vLLM, etc).

The Responses API: decouple the brain from the voice loop

The main bottleneck in the system is LLM inference latency. To address that, we support external inference engines exposed through the Responses API protocol.

The speech-to-speech engine therefore supports a second mode where the LLM lives in a separate process as long as it speaks the Responses API protocol. You launch your model server in one terminal, you launch the voice loop in another terminal, and the two talk over HTTP.

Option 1: llama.cpp in one terminal, speech-to-speech in the other

Terminal 1: llama.cpp server:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full

Terminal 2: speech-to-speech client:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "unsloth/Qwen3-4B-Instruct-2507-GGUF" \
  --responses_api_base_url "http://127.0.0.1:8080/v1"

Option 2: vLLM in one terminal, speech-to-speech in the other

Requires vLLM ≥ 0.21.0. Full support for the Responses API protocol, including tool-call streaming used by the speech-to-speech backend, landed in vLLM 0.21.0. Older versions will boot but trip up as soon as the assistant tries to call a tool.

When serving a model through vLLM for this pipeline, three flags are effectively required:

--enable-auto-tool-choice
--tool-call-parser <tool_parser_name> — picks the per-family parser that turns the model's raw output into structured tool calls (e.g. qwen3_coder for Qwen3 instruct models, llama3_json for Llama 3, hermes for Hermes-style models, ...).
--default-chat-template-kwargs '{"enable_thinking":false}' : disables the <think> reasoning channel for models that support it. For harder agentic tasks you can flip this to true and let the model reason, but for a natural-feeling conversation we strongly recommend keeping it off: every thinking token is latency the user hears as silence before the robot starts speaking.

Terminal 1: vLLM inference server (Qwen/Qwen3-4B-Instruct-2507):

vllm serve Qwen/Qwen3-4B-Instruct-2507 \
  --port 8000 \
  --host 127.0.0.1 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --default-chat-template-kwargs '{"enable_thinking":false}' \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'

The --speculative-config line enables Multi-Token Prediction (MTP). It is optional, but it has a great impact on end-to-end latency. Leave it on whenever the model supports it.

Terminal 2: speech-to-speech client:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3-4B-Instruct-2507" \
  --responses_api_base_url "http://127.0.0.1:8000/v1"

Option 3: Hugging Face Inference Endpoints

Same protocol, but the model runs on a managed GPU on Hugging Face. Deploy any chat model as an Inference Endpoint, then point the voice loop at the endpoint URL:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3-4B-Instruct-2507" \
  --responses_api_base_url "https://<your-endpoint>.endpoints.huggingface.cloud/v1" \
  --responses_api_api_key "$HF_TOKEN"

Option 4: Hugging Face Inference Providers

If you don't want to manage your own endpoint, use an Inference Provider. Hugging Face routes your request to a third-party backend (e.g. Together, Fireworks, Replicate) with a single URL:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "Qwen/Qwen3.6-35B-A3B:deepinfra" \
  --responses_api_base_url "https://router.huggingface.co/v1" \
  --responses_api_api_key "$HF_TOKEN"

Option 5: OpenAI (or any OpenAI-compatible provider)

When you want to test against a frontier model with zero infra, point the same flag at OpenAI:

speech-to-speech \
  --mode realtime \
  --stt parakeet-tdt \
  --tts qwen3 \
  --llm_backend responses-api \
  --model_name "gpt-5.4" \
  --responses_api_api_key "$OPENAI_API_KEY"

The --responses_api_* flags work the same for any provider that implements the protocol (OpenRouter, Together, Fireworks, …). Swap the base URL and the API key, keep the rest of the pipeline identical.

Running the LLM in-process

Option 1: Local LLM on MLX (Apple Silicon)

If you are on a Mac, MLX is the lowest-friction way to run a real model with sane latency. We recommend Qwen3-4B-Instruct-2507, which is small enough to feel instant on M-series chips and capable enough to hold a conversation.

speech-to-speech \
  --llm_backend mlx-lm \
  --model_name "mlx-community/Qwen3-4B-Instruct-2507-bf16"

The server listens on ws://127.0.0.1:8765/v1/realtime by default. Leave it running, connect the conversation app to the local backend, and you are talking to your robot.

Option 2: Local LLM on Transformers (CUDA / CPU / MPS)

Same idea, but using vanilla transformers. Use this if you are on a CUDA box, on Linux, or if you want to swap models freely without re-converting weights for MLX.

speech-to-speech \
  --llm_backend transformers \
  --model_name "Qwen/Qwen3-4B-Instruct-2507"

Tip. Qwen3-4B-Instruct-2507 is another good option for LLM because it gives a good speed/quality balance on a single consumer GPU. You can point --model_name at any HF model the backend supports — for example a larger Gemma, Qwen, or a Mistral.

Running the engine on your laptop, the app on the robot

If you are running the voice engine on your laptop and the conversation app on a Reachy Mini Wireless, the only thing that changes is the URL. Make sure the engine binds to a LAN address (not just 127.0.0.1) and use the laptop's IP from the robot when you select the IP in the UI.

If you don't know your IP, here's how to find it:

macOS

ipconfig getifaddr en0    
ipconfig getifaddr en1

Linux

hostname -I

Windows

ipconfig

Look for "IPv4 Address" under your active adapter.

You want the 192.168.x.x or 10.x.x.x one. If you see 169.254.x.x, you're not actually on the network.

Wrap up

You now have a fully local voice loop:

A robot listening with Silero,
transcribing with Parakeet-TDT 0.6B v3,
thinking with whichever LLM you picked, whether that's local MLX, local Transformers, a vLLM or llama.cpp server next door, or a hosted Responses API endpoint,
and answering with Qwen3-TTS.

Star huggingface/speech-to-speech and pollen-robotics/reachy_mini_conversation_app, and come tell us in the discussions which open-source cascade you ended up running on your robot.

#robotics #edge-ai #local-inference #machine-learning #on-device #privacy-first