使用 MCP 协议构建语音助手

什么是 MCP

MCP（Model Context Protocol）是由 Anthropic 提出的开放协议，旨在标准化大语言模型与外部工具、数据源之间的交互方式。简单来说，它定义了一套规范，让 LLM 能够：

发现可用的工具和服务
理解每个工具的输入输出格式
在对话中调用这些工具

类似的功能在 OpenAI 的 Function Calling 中也有实现，但 MCP 的优势在于它是一个开放标准，不绑定特定厂商。

语音助手的架构

我构建的语音助手 hachimi 采用了模块化的多进程架构：

┌─────────────────────────────────────────────────┐
│                    主控进程                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
│  │ 语音唤醒  │  │ 语音合成  │  │   LLM 对话    │   │
│  │  (VAD)   │  │  (TTS)   │  │              │   │
│  └────┬─────┘  └────┬─────┘  └──────┬───────┘   │
└───────┼─────────────┼───────────────┼───────────┘
        │             │               │
        └─────────────┴───────────────┘
                      │
              ┌───────┴───────┐
              │  MCP 工具管理器 │
              └───────┬───────┘
                      │
        ┌─────────────┼─────────────┐
        │             │             │
   ┌────┴───┐   ┌────┴───┐   ┌────┴────┐
   │天气查询 │   │设备控制 │   │文件操作 │
   └────────┘   └────────┘   └─────────┘

为什么用多进程

语音助手需要同时处理多个实时任务：

持续监听唤醒词（CPU 占用高，需要实时性）
流式语音识别（与云端服务保持长连接）
LLM 对话生成（可能耗时较长）
TTS 语音合成（可能需要边生成边播放）

多进程可以让这些任务并行执行，互不阻塞。Python 的 GIL 限制了多线程的并发能力，所以选择了多进程方案。

MCP 工具的实现

要让 LLM 能够调用外部工具，需要完成以下步骤：

1. 定义工具描述

每个工具都需要提供 JSON Schema 格式的描述，让 LLM 知道它的用途和参数：

weather_tool = {
    "name": "get_weather",
    "description": "获取指定城市的当前天气",
    "parameters": {
        "type": "object",
        "properties": {
            "city": {
                "type": "string",
                "description": "城市名称，如\"北京\"、\"上海\""
            }
        },
        "required": ["city"]
    }
}

light_tool = {
    "name": "control_light",
    "description": "控制房间灯光",
    "parameters": {
        "type": "object",
        "properties": {
            "room": {
                "type": "string",
                "enum": ["客厅", "卧室", "厨房"]
            },
            "action": {
                "type": "string",
                "enum": ["on", "off"]
            }
        },
        "required": ["room", "action"]
    }
}

2. 注册工具到 MCP

from mcp.server import Server
from mcp.types import Tool

server = Server("hachimi-voice-assistant")

@server.list_tools()
async def list_tools():
    return [
        Tool(name="get_weather", description="...", inputSchema={...}),
        Tool(name="control_light", description="...", inputSchema={...}),
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "get_weather":
        return await fetch_weather(arguments["city"])
    elif name == "control_light":
        return await toggle_light(arguments["room"], arguments["action"])

3. LLM 对话中集成

在与 LLM 交互时，将工具列表作为 context 传入：

messages = [
    {"role": "system", "content": "你是一个语音助手..."},
    {"role": "user", "content": user_input}
]

response = await llm.chat(
    messages=messages,
    tools=available_tools,  # 传入工具列表
    tool_choice="auto"
)

# 检查 LLM 是否需要调用工具
if response.tool_calls:
    for tool_call in response.tool_calls:
        result = await execute_tool(tool_call)
        # 将工具结果返回给 LLM
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": result
        })
    # 再次请求 LLM，获得最终回复
    final_response = await llm.chat(messages=messages)

打断功能（Barge-in）的实现

打断是语音助手的重要功能——当助手正在说话时，用户可以直接发出新指令。

实现思路：

助手说话时，唤醒词检测保持运行
检测到唤醒词后，立即停止 TTS 播放
丢弃当前对话上下文，开始新一轮交互

class VoiceAssistant:
    def __init__(self):
        self.wake_word = WakeWordDetector()
        self.tts = TTSPlayer()
        self.interrupted = Event()
    
    async def speak(self, text):
        """带打断检测的语音播放"""
        self.interrupted.clear()
        
        async def check_interrupt():
            while self.tts.is_playing:
                if self.wake_word.detected():
                    self.tts.stop()
                    self.interrupted.set()
                    return
                await asyncio.sleep(0.05)
        
        # 并发执行播放和打断检测
        await asyncio.gather(
            self.tts.speak(text),
            check_interrupt()
        )
        
        return not self.interrupted.is_set()

实际应用场景

这个语音助手目前已经可以：

控制家里的智能设备（灯、空调、窗帘）
查询天气、新闻、股票等信息
设置提醒和闹钟
进行多轮对话（结合对话历史）

遇到的问题

延迟问题

语音交互对延迟很敏感。优化措施：

使用流式语音识别，边录边传
本地缓存常用工具的结果
TTS 采用流式合成，不必等全部文本生成完毕

误唤醒问题

背景噪声导致的误唤醒。解决方案：

提高 VAD（语音活动检测）阈值
唤醒后增加二次确认（如"我在"回应）

总结

MCP 协议为 LLM 应用提供了一种标准化的工具调用方式。通过它，语音助手可以灵活地扩展功能，而无需修改核心代码。这个项目也让我对实时系统、多进程编程有了更深的理解。

项目地址：https://github.com/cyijun/hachimi